Supported Validators

RedPen supports the following validators.

  • SentenceLength
  • InvalidExpression
  • InvalidWord
  • SpaceBeginningOfSentence
  • CommaNumber
  • WordNumber
  • SuggestExpression
  • InvalidSymbol
  • SymbolWithSpace
  • KatakanaEndHyphen
  • KatakanaSpellCheck
  • SectionLength
  • SpaceBetweenAlphabeticalWord
  • ParagraphNumber
  • ParagraphStartWith
  • Contraction
  • Spelling
  • DoubledWord
  • SuccessiveWord
  • DuplicatedSection
  • JapaneseStyle
  • DoubleNegative
  • FrequentSentenceStart
  • UnexpandedAcronym
  • WordFrequency
  • Hyphenation
  • NumberFormat
  • ParenthesizedSentence
  • WeakExpression

SentenceLength

SentenceLength validator checks the length of sentences in the input document. If the length of the sentence is greater than the specified maximum length, the validator generates a warning.

Properties

Property Default Value Description
max_len 50 Maximum length of sentence.

Supported langauges

SentenceLength can be applied to any languages.

InvalidExpression

InvalidExpression validator checks if input sentences contain invalid expressions (words or phrases). If the input sentence contains invalid expressions, the validator generates a warning.

Properties

Property Default Value Description
dict None File name of dictionary.
list None List of invalid expression split by comma.

The dictionary is a set of words or expressions. The following is an example of a dictionary.

like
you know
hey
kidding
what the hell
...

Supported langauges

InvalidExpression can be applied to any languages.

InvalidWord

InvalidWord validator checks if input sentences contain invalid words. If the input sentence contains invalid words, the validator generates a warning.

Properties

Property Default Value Description
dict None File name of dictionary.
list None List of invalid expression split by comma.

The dictionary is a set of words. The following is an example of a dictionary.

like
hey
wow
...

Supported Languages

InvalidWord can be any of langauges (but the default dictionaries are supplied only for English and Japanese).

SpaceBeginningOfSentenceValidator

SpaceBeginningOfSentenceValidator validator checks if there is a white space at the end of input sentences (except for the very last sentence of paragraph). If the input sentence does end with a white space, a warning is given.

Supported langauges

SpaceBeginningOfSentenceValidator can be applied to any langauges.

CommaNumber

CommaNumber validator checks the number of commas in a sentence.

Properties

Property Default Value Description
max_num 4 Maximum number of commas in a sentence.

Supported languages

CommaNumber can be applied to any languages.

WordNumber

WordNumber validator checks the number of words in one setnece.

Properties

Property Default Value Description
max_num 50 Maximum number of words in a sentence.

Supported langauges

WordNumber can be applied to any languages except for some Asian languages (Chinese or Thai), since RedPen does not have the tokenizer for the unspported languages.

SuggestExpression

SuggestExpression validator works in a similar way to the InvalidExpression validator. If the input sentence contains invalid expressions, this validator returns a warning suggesting the correct expression.

Properties

Property Default Value Description
dict None File name of dictionary.

The dictionary is a TSV file with two columns. First column contains the invalid expression, and the second column contains a suggested replacement expression.

SVM    Support Vector Machine
LLVM   Low Level Virtual Machine
...

Supported langauges

SuggestExpression can be any of languages but the default dictionaries are provided only for English and Japanese.

InvalidSymbol

Some symbols or characters have alternate characters with the same role. For example question mark ”? (0x003F)” has another unicode variation “?(0xFF1F)”. InvalidSymbol checks if input sentences contains invalid characters or symbols. The symbols and character settings are entered into the character setting file (char-table.xml). In this file, we write the symbols we should use in the document and their invalid counterparts. The details of these settings is described in the next section.

Supported languages

InvalidSymbol works for any langugages. See the settings of symbols in the Configuration page.

SymbolWithSpace

Some symbols need space before or after them. For example, if we want to ensure a space is added before a left parentheses “(”, we could add this preference to the character setting file (char-table.xml).

Supported languages

InvalidSymbol works for any languages.

KatakanaEndHyphen

KatakanaEndHyphen validator checks the end hyphens of Katakana words in Japanese documents. Japanese Katakana words have variations in their end hyphen. For example, “computer” is written in Katakana as “コンピュータ” (without hyphen), and “コンピューター” (with hypen). This validator checks to ensure that Katakana words match the predefined standard. See JIS Z8301, G.6.2.2 b) G.3.

  • a: Words of 3 characters or more cannot have an end hyphen.
  • b: Words of 2 characters or less can have an end hyphen.
  • c: A compound word should apply a and b to each component word.
  • d: In the cases from a to c, the length of a syllable which is represented by a hyphen is 1 except for Youon.

Supported languages

KatakanaEndSymbol works only for Japanees texts.

KatakanaSpellCheck

KatakanaSpellCheck validator checks if Katakana words have very similar words with different spellings in the document. For example, if the Katakana word “インデックス” and the variation “インデクス” exist within the same document, this validator will return a warning.

Property Default Value Description
dict None Path to a user dictionary for skip list of Katakana words.
min_ratio 0.2 Threshold of the minimum similarity. KatakanaSpellCheck reports an error when there is a pair of words of which the similarity is more than the min_ratio.
min_freq 5 Threshold of the minimum word frequency. KatakanaSpellCheck checks words of which frequencies are less than min_freq.

Supported languages

KatakanaSpellCheck works only for Japanees texts.

SectionLength

SectionLength validator checks the maximum number of words allowed in an section.

Properties

Property Default Value Description
max_num 1000 Maximum number of words in a section.

Supported lanauges

SectionLength works for any languages.

ParagraphNumber

ParagraphNumber validator checks the maximum number of paragraphs allowed in one section.

Properteis

Property Default Value Description
max_num 5 Maximum number of paragraphs in a seciton.

Supported lanauges

ParagraphNumber works for any languages.

ParagraphStartWith

ParagraphStartWith validator checks to see if the characters at the beginning of paragraphs conforms to the correct style.

Properties

Property Default Value Description
start_with ” “ Characters in the beginning of paragraphs.

Supported languages

ParagraphStartWith works for any langugaes.

SpaceBetweenAlphabeticalWord

SpaceBetweenAlphabeticalWord validator checks that alphabetic words are surrounded with whitespace. This validator is used in non-latin languages such as Japanese or Chinese.

Supported languages

SpaceBetweenAlphabeticalWord works for languages whose words are not split by white spaces such as Japanese or Chinese.

Contraction

Contraction validator throws an error when contractions are used in a document in which more than half of the verbs are written in non-contracted form.

Supported languages

Contraction works only for English texts.

Spelling

Spelling validator throws an error if there are spelling mistakes in the input documents. This validator only works for English documents.

Supported languages

Spelling works only for English texts.

DoubledWord

DoubledWord validator throws an error if a word is used more than once in a sentence. For example, if an input document contains the following sentence, the validator will report an error since good is used twice.

Properties

this good item is very good.
Property Default Value Description
dict None File name of skip list dictionary.
list None List of skip words split by comma.

Supported languages

DoubledWord works for any langages except for Chiense or other Asian languages. Note that the default dictionaries are supplied for Japanese and English.

SuccessiveWord

SuccessiveWord validator throws an error if the same word is used twice in succession. For example, if an input document contains the following sentence, the validator will report an error since is is used twice in succession.

the item is is very good.

Supported languages

SuccessiveWord works for any langages except for Chiense or other Asian languages.

DuplicatedSection

DuplicatedSection validator throws an error if there are section pairs which have almost the same content.

Supported languages

DuplicatedSection works for any languages.

JapaneseStyle

JapaneseStyle validator reports errors if the input file contains both “dearu” and “desu-masu” style.

Supported languages

JapaneseStyle works only for Japanese

DoubleNegative

DoubleNegative validator reports errors when input sentence contains double negative expression.

Supported languages

DoubleNegative works only for English and Japanese texts.

FrequentSentenceStart

This validator reports an error if too many sentences start with the same sequence of words.

Property Default Value Description
leading_word_limit 3 Number of words starting each sentence to consider.
percentage_threshold 25 Maximum percentage of sentences that can start with the same words.
min_sentence_count 5 Minimum number of sentences required for the validator to report errors.

Supported languages

FrequentSentenceStart works for any languages.

UnexpandedAcronym

This validator ensures that there are candidates for expanded versions of acronyms somewhere in the document.

That is, if there exists an acronym ABC in the document, then there must also exist a sequence of capitalized words such as Axxx Bxx Cxxx.

Properties

Property Default Value Description
min_acronym_length 3 Minimum size for the acronym

Supported languages

UnexpandedAcronym works only for English texts.

WordFrequency

This validator ensures that usage of specific words in the document don’t occur too frequently. It calculates the frequency that words are used and compares them the a reference histogram of word frequency for written English.

Excessive deviation from normal usage generates a validation error.

Properties

Property Default Value Description
deviation_factor 3 Permitted factor of deviation from the norm. So if a word is normally used 3% of the time, your document can use it up to 9% of the time.
min_word_count 200 Minimum number of words in a document before this validator starts to validate

Supported languages

WordFrequency works only for English texts.

Hyphenation

This validator ensures that sequences of words that are hyphenated in the dictionary are hyphenated in your document.

Supported languages

Hyphenation works only for English texts.

NumberFormat

This validator ensures that numbers in a sentence are formatted using commas (ie: 12,000 instead of 120000), and don’t have excessive decimal points.

Properties

Property Default Value Description
decimal_delimiter_is_comma false Change the decimal delimiter from . to , (as in Europe)
ignore_years false Ignore 4 digit integers (2015, 1998)

Supported languages

NumberFormat works for texts written in European languages such as English or French.

ParenthesizedSentence

This validator generates errors if parenthesized sentences (such as this) are used too frequently, or are nested too heavily.

Properties

Property Default Value Description
max_nesting_level 2 The limit on how many parenthesized expressions are permitted
max_count 1 The number of parenthesized expressions allowed
max_length 4 The maximum number of words in a parenthesized expression

Supported languages

ParenthesizedSentence works only for texts written in Eurpopean languages.

WeakExpression

This validator generates errors if sequences of words form what is generally considered to be a “weak expression”.

Supported languages

WeakExpression works only for English.