Overview

RedPen is an open source proofreading tool to check if your technical documents meet writing standards.

RedPen Quickstart

This quickstart guide is to help you get started with RedPen. Let’s go through some of the basics.

Requirements

RedPen requires the following software.

  • Java 1.8.40 or greater

Example run

First, download the RedPen package from release page, and then decompress the package with the following commands.

$ tar xvf redpen-*-assembled.tar.gz
$ cd redpen-*

Then, run the redpen command with the supplied sample document and configuration file.

$ bin/redpen -c conf/redpen-conf-en.xml sample-doc/en/sampledoc-en.txt
14:32:37.639 [main] INFO  org.unigram.docvalidator.Main - loading character table file: sample/conf/symbol-conf-en.xml
14:32:37.652 [main] INFO  o.u.docvalidator.util.CharacterTable - Succeeded to load character table
14:32:37.654 [main] INFO  o.unigram.docvalidator.parser.Parser - comma is set to ","
14:32:37.655 [main] INFO  o.unigram.docvalidator.parser.Parser - full stop is set to "."
14:32:37.663 [main] INFO  o.u.d.v.s.ParagraphStartWithValidator - Using the default value of paragraph_start_with.
CheckError[sample/doc/txt/en/sampledoc-en.txt: 0] = The length of the line exceeds the maximum 265 in line: ln bibliometrics and link analysis studies many attempts have been made to analyze the \
relationship amongscientific papers, authors andjoumals and recently, these research results have been found to be effective for analyzing the link structure ofweb pages as we11.
CheckError[sample/doc/txt/en/sampledoc-en.txt: 0] = The length of the line exceeds the maximum 161 in line:  In addition,  Most of these methods are concernedwith the two link analysis measures: \
relatedness between documenatsndglobal importance of individual documents.
...

RedPen Commands

RedPen provides both a simple standalone command line tool and a server.

Command line tool

RedPen provides a simple command line tool called 'redpen' to check documents.

Using redpen

We use the redpen command as follows.

$ redpen [options] input-files

By default, input files are delimited by whitespace and then analysed. The redpen command supports the following options.

Options

The redpen command has the following options.

Specify the RedPen configuration file
-c <CONFIG_FILE>, --configuration <CONFIG_FILE>

RedPen CLI (redpen command) search the configuration file when users do not specify the configuration file with -c option. First RedPen loads redpen-conf.xml in the current directory. If redpen-conf.xml does not exist in the current directory, RedPen loads the localized setting file, redpen-conf-{lang}.xml in current directory, where {lang} is language code in ISO 639-1. The language codes are selected with the locale setting of the users computer. When there is no localized setting file, RedPen try to load the configuration file in $REDPEN_HOME/conf.

Input file format [default: plain]
-f <INPUT_FORMAT>, --format <INPUT_FORMAT>

This argument specifies the input format. Currently RedPen supports the following formats.

Value Description

plain

Plain text format

wiki

Wiki (Textile) format

markdown

Markdown format

asciidoc

AsciiDoc format

review

Re:VIEW format

latex

LaTeX format

properties

Java property file format

rest

reStructuredText format

Note
When users do not specify the input format with -f, redpen command guess the format from file extensions. The followings table shows the list of file extensions, which redpen command understand.
Value Extension

plain

txt

wiki

wiki

markdown

md, markdown

asciidoc

adoc, asciidoc

review

re, review

latex

tex, latex

properties

properties

rest

reStructuredText

Result format [default: plain]
option:: -r <RESULT_FORMAT>, --result_format <RESULT_FORMAT>

This argument determines the output format. Currently RedPen supports the following output formats.

Value Description

plain

plain text format

plain2

an alternate plain text format collated by sentence

xml

xml format

json

json format

json2

an alternate json format collated by sentence

Specify the limit of error number [default: 1]

The redpen command returns 0 when the number of found errors are less than the specified limit.

option:: -l <LIMIT NUMBER>, --limit  <LIMIT NUMBER>
Specify the language of error messages [default: depends on the locale settings of the machine]

For selecting language of errors, we specify the language code (en or ja).

option:: -L <LANGUAGE>,--lang <LANGUAGE>
Specify the input sentences

Commonly users specify input files, but sometimes making files is tidious. For testing purpose redpen command provides the parameter for input sentences.

option:: -s <INPUT SENTENCE>, --sentence  <INPUT SENTENCE>
Display help
-h, --help
Show the redpen version
--version

RedPen server

RedPen also provides the server. RedPen server provides not only UI but also practical REST API (see RedPen Server section). The following is an image of RedPen server UI.

Image

Usage: redpen-server

We can start and stop the redpen server with the following command.

$ redpen-server [start|stop]

Configuration

The redpen-server command is able to be configured with editing the variables in redpen-server file itself. The following table shows the configuration variables and the default values.

Configuration Default Value Description

REDPEN_PORT

8080

Specify Port number of RedPen server.

STOP_KEY

redpen.stop

RedPen server is able to stop with Stop key with http access. If you do not want to stop with stop key comment out the value.

REDPEN_CONF_FILE

Specify default redpen config file.

REDPEN_LANGUAGE

Depends on locale settings

Specify the language of error messages from RedPen.

The functionality of the RedPen server is described in the RedPen Server section.

RedPen Input Formats

RedPen supports several types of input formats:

  • Plain Text

  • Markdown

  • AsciiDoc

  • Wiki

  • LaTeX

  • Re:VIEW

  • Java Properties

  • reStructuredText

Plain text

Plain text supports a set of paragraphs. Paragraphs are separated by two new lines. For example, the following article has two paragraphs.

This is a first paragraph. This paragraph is the introduction of this article.
It introduces the central issue discussed throughout the rest of the article.

Second paragraph describes the details of the issue and attempts to present a solution.

AsciiDoc

LaTeX

Note
RedPen does not supports Macros defined by writers.

Wiki format

RedPen supports a subset of Wiki syntax. Currently, the supported elements of Wiki syntax are as follows.

Headings

To create a heading, add a line starting with h[1234]. The number after h represents the level of the heading or section.

Inline Formatting

RedPen supports the following inline formatting.

Bold

**this is a Bold sentence.**

Italic

//this is an italic sentence.//

Underline

__this is an underlined sentence.__

Strikethrough

--this is a strikethrough sentence.--

Links elements are included in Wiki formatted documents.

Lists

Wiki syntax supports two types of lists.

Bulleted Lists

To enter a bulleted list, start a line with an asterisk. The number of asterisks denotes the indent level of the list.

* List
* List
** Sub List
Numbered List

If you want to add numbered lists, use the hash/pound symbol (#) instead of the asterisk used by Bulleted Lists.

Comments

To add a comment to the wiki source, add a [!-- …​ --] block. The following shows a sample comment.

[!--
  This is a comment.
--]

Paragraphs

Paragraphs are separated by two new lines. This syntax is the same as for plain text.

Markdown

RedPen currently supports the following Markdown elements.

Headings

Two styles of headings are supported.

  • Underlined headings

First and second level headings can be specified using underlines.

First-level headings
====================
second-level headings
---------------------
  • Atx style headings

1-6 hash or pound characters (#) at the beginning of a line.

For example:

# First-level heading
## Second-level heading
### Third-level heading

Inline Formatting

RedPen supports the following inline formatting.

Bold

Wrap characters with double asterisks or underscores for bold. The following are samples of bold sentences.

**this is a Bold sentence.**
__this is also a Bold sentence.__
Italic

Wrap characters with a single asterisk or underscore for italics. The following are samples of italic sentences.

*this is a italic syntax.*
_this is also a italic syntax._

To create a link, wrap square brackets around the link’s label and parentheses around the URL. For example.

[label](url)

Lists

The Markdown parser used by RedPen supports two types of lists - Bulleted lists and Numbered lists.

Bulleted Lists

To create a bulleted list, start a line with an asterisk or a hyphen. The lists are nested according to how many leading spaces there are. The following is a example of a bulleted list using asterisks.

* List
* List
  * Sub List
  * Sub List
Numbered List

If you want to create a numbered list, use a number followed by a period, as in the following example.

1. List
2. List

Paragraphs

Paragraphs are separated by two new lines. This syntax is the same as for plain text.

Re:VIEW format

Java Properties

Properties files or Resource Bundles are commonly used for internalization in Java. RedPen treats every property as a section, which can have one or more sentences. Comments and values, but not keys are validated.

See the Properties Javadoc for more information on file format.

reStructuredText

Note
currently RedPen only supports basic notations

RedPen Output Formats

RedPen supports three basic output formats - Plain text, XML, and JSON.

Note
From v1.10, RedPen supports2 error level (default is error). For the configuration, please refer to the Validator configuration section.

Plain text

Plain text output format consists of the following lines.

FILE_NAME:LINE_NUM: Validation[Error|Info|Warn][ERROR_TYPE], ERROR_MESSAGE at line: SENTENCE

An alternate plain text form (plain2) prints each sentence, followed by all of the errors found in the sentence.

XML

The top section of the XML output format is validation-result element which contains multiple error sections. Each error section has the following sub-elements.

Block

Optional

Description

validator

false

Validator name

message

false

Error message

lineNum

false

Line Number

sentence

false

Sentence containing error

file

true

File name

level

String

Error level (Info, Warn, Error)

JSON

Block

Optional

Description

validator

false

Validator name

message

false

Error message

lineNum

false

Line Number

sentence

false

Sentence containing error

file

true

File name

level

String

Error level (Info, Warn, Error)

The alternative JSON output format (json2) collates each error by the sentence it relates to.

RedPen Configuration

RedPen’s configuration file consists of two separate blocks. One block configures the RedPen validators and the other lets you override characters and symbols for the input documents.

Configuration file

RedPen has a single configuration file, which contains all the settings RedPen requires. The main configuration file is an xml file with a root element of redpen-conf. Within this element there are two sub elements named "validators" and "symbols."

To apply the default validators and symbol settings for a target language, we just specify lang attributes in the redpen-conf element.

The validators section specifies which validators are to be loaded by RedPen. Each validator within this section can have their property values overridden.

The symbols section overrides the default symbol settings for the target language. The following is an example of a RedPen configuration file.

<redpen-conf lang="en">
    <validators>
        <validator name="SentenceLength">
            <property name="max_len" value="200"/>
        </validator>
        <validator name="InvalidSymbol" />
        <validator name="SpaceWithSymbol" />
        <validator name="SectionLength">
            <property name="max_num" value="2000"/>
        </validator>
        <validator name="ParagraphNumber" />
    </validators>
    <symbols>
         <symbol name="EXCLAMATION_MARK" value="!" invalid-chars="" after-space="true" />
         <symbol name="LEFT_QUOTATION_MARK" value="\'"  invalid-chars="" before-space="true" />
    </symbols>
</redpen-conf>

In the next section we will cover the configuration of validators in greater detail. The settings for the symbols section are described in the Setting Symbols section.

Validator configuration

The RedPen configuration file contains a "validators" section for registering Validators. RedPen will apply each validator specified in this section to the to the input document. The following is a sample "validators" section.

<validators>
    <validator name="SentenceLength">
        <property name="max_len" value="200"/>
    </validator>
    <validator name="InvalidSymbol" level="info"/>
    <validator name="SpaceWithSymbol" />
    <validator name="SectionLength" level="warn">
        <property name="max_num" value="2000"/>
    </validator>
    <validator name="ParagraphNumber" />
 </validators>

Each validator is configured within its own validator element. The name attribute of this element specifies the name of the function (called validator). Each validator is responsible for checking a particular aspect of the input document. For example, if the "SectionLength" is added in the configuration, then RedPen will check the length of each section in the input document.

Some validator components can be configured using "property" elements. For example, you can override the maximum character count in the "SectionLength" validator by specifying a "max_num" property. Some validators also have "sub-validators" which can also be configured within the validator section.

Users can specify the severity of errors from each validator using level attribute. For example, the error level of "InvalidSymbol" in the above example is set as "info" level. For details, please refert to RedPen Output Format section.

We cover all supported validators in the Validator section.

Setting symbols

The lang attribute of the redpen-conf element determines how various symbols are handled by RedPen. RedPen supports default symbols for "en" and "ja", which are described in English symbols and Japanese Symbols.

The default symbol settings for a target language can be overridden by configuring the symbols section of the RedPen configuration file.

The default settings are described in the following sections. Within the symbols configuration section we can use symbol elements to specify which symbols to use when validating documents. Each "symbol" element overrides a character found in the documents.

The following table describes the properties of the symbol element.

Property Mandatory Default Value Description

name

true

none

Name of the symbol

value

true

none

Value of the symbol

before-space

false

false

Need space before the symbol

after-space

false

false

Need space after the symbol

invalid-chars

false

""

List of invalid symbols

Sample: Setting symbols

In the following example, we can see a symbols section that defines 3 symbols. The first element defines exlamation mark as '!'. Then, FULL_STOP defines a period as the character "." and specifies that the symbol must be followed by a space. The third element defines comma as ',' and also defines '、' and ',' as invalid comma characters. This is because some characters have equivalent symbolic meanings.

For example, in Japanese both '.' and '。' can represent a FULL_STOP. The invalid-chars setting allows us to restrict which character alternatives are permitted in our documents.

<symbols>
    <symbol name="EXCLAMATION_MARK" value="!" />
    <symbol name="FULL_STOP" value="." after-space="true" />
    <symbol name="COMMA" value="," invalid-chars="、," after-space="true" />
</symbols>

Default Settings for English

The following table shows the default symbol settings for English and other latin based documents. In the table, the first column contains the name of each symbol and the second column (Value) shows the symbol’s character value. The columns 'NeedBeforeSpace', 'NeedAfterSpace' indicate if the symbol should be followed by or preceded by a space, respectively. The column, 'InvalidChars' shows the symbol’s invalid variant characters.

Symbol Value NeedBeforeSpace NeedAfterSpace InvalidChars Description

FULL_STOP

'.'

false

true

'.', '。'

Sentence period

SPACE

' '

false

false

' '

White space between words

EXCLAMATION_MARK

'!'

false

true

'!'

Exclamation mark

NUMBER_SIGN

'#'

false

false

'#'

Number sign

DOLLAR_SIGN

'$'

false

false

'$'

Dollar sign

PERCENT_SIGN

'%'

false

false

'%'

Percent sign

QUESTION_MARK

'?'

false

true

'?'

Question mark

AMPERSAND

'&'

false

true

'&'

Ampersand

LEFT_PARENTHESIS

'('

true

false

'('

Left parenthesis

RIGHT_PARENTHESIS

')'

false

true

')'

Right parenthesis

ASTERISK

'*'

false

false

'*'

Asterrisk

COMMA

','

false

true

'、',','

Comma

PLUS_SIGN

'+'

false

false

'+'

Plus sign

HYPHEN_SIGN

'-'

false

false

'ー'

Hyphenation

SLASH

'/'

false

false

'/'

Slash

COLON

':'

false

true

':'

Colon

SEMICOLON

';'

false

true

';'

Semicolon

LESS_THAN_SIGN

'<'

false

false

'<'

Less than sign

GREATER_THAN_SIGN

'>'

false

false

'>'

Greater than sign

EQUAL_SIGN

'='

false

false

'='

Equal sign

AT_MARK

'@'

false

false

'@'

At mark

LEFT_SQUARE_BRACKET

'['

true

false

Left square bracket

RIGHT_SQUARE_BRACKET

']'

false

true

Right square bracket

BACKSLASH

'\'

false

false

Backslash

CIRCUMFLEX_ACCENT

'^'

false

false

'^'

Circumflex accent

LOW_LINE

'_'

false

false

'_'

Low line (under bar)

LEFT_CURLY_BRACKET

'{'

true

false

'{'

Left curly bracket

RIGHT_CURLY_BRACKET

'}'

true

false

'}'

Right curly bracket

VERTICAL_BAR

'|'

false

false

'|'

Vertical bar

TILDE

'~'

false

false

'〜'

Tilde

LEFT_SINGLE_QUOTATION_MARK

'''

false

false

Left single quotation mark

RIGHT_SINGLE_QUOTATION_MARK

'''

false

false

Right single quotation mark

LEFT_DOUBLE_QUOTATION_MARK

'"'

false

false

Left double quotation mark

RIGHT_DOUBLE_QUOTATION_MARK

'"'

false

false

Right double quotation mark

These settings are used by several Validators such as InvalidSymbol and SpaceValidator.If you change the symbol definition, you can override the settings adding symbol elements to the symbols block in the configuration file.

Default Settings for Japanese

The following table shows the default symbol settings for Japanese documents.

Symbol Value NeedBeforeSpace NeedAfterSpace InvalidChars Description

FULL_STOP

'。'

false

false

'.','.'

Sentence period

SPACE

' '

false

false

White space between words

EXCLAMATION_MARK

'!'

false

false

'!'

Exclamation mark

NUMBER_SIGN

'#'

false

false

'#'

Number sign

DOLLAR_SIGN

'$'

false

false

'$'

Dollar sign

PERCENT_SIGN

'%'

false

false

'%'

Percent sign

QUESTION_MARK

'?'

false

false

'?'

Question mark

AMPERSAND

'&'

false

false

'&'

Ampersand

LEFT_PARENTHESIS

'('

false

false

'('

Left parenthesis

RIGHT_PARENTHESIS

')'

false

false

')'

Right parenthesis

ASTERISK

'*'

false

false

'*'

Asterrisk

COMMA

'、'

false

false

',',','

Comma

PLUS_SIGN

'+'

false

false

'+'

Plus sign

HYPHEN_SIGN

'ー'

false

false

'-'

Hyphenation

SLASH

'/'

false

false

'/'

Slash

COLON

':'

false

false

''

Colon

SEMICOLON

';'

false

false

''

Semicolon

LESS_THAN_SIGN

'<'

false

false

''

Less than sign

GREATER_THAN_SIGN

'>'

false

false

''

Greater than sign

EQUAL_SIGN

'='

false

false

'='

Equal sign

AT_MARK

'@'

false

false

'@'

At mark

LEFT_SQUARE_BRACKET

'「'

true

false

Left square bracket

RIGHT_SQUARE_BRACKET

'」'

false

false

Right square bracket

BACKSLASH

'¥'

false

false

Backslash

CIRCUMFLEX_ACCENT

'^'

false

false

'^'

Circumflex accent

LOW_LINE

'_'

false

false

'_'

Low line (under bar)

LEFT_CURLY_BRACKET

'{'

true

false

'{'

Left curly bracket

RIGHT_CURLY_BRACKET

'}'

true

false

'}'

Right curly bracket

VERTICAL_BAR

'|'

false

false

'|'

Vertical bar

TILDE

'〜'

false

false

'~'

Tilde

LEFT_SINGLE_QUOTATION_MARK

'‘'

false

false

Left single quotation mark

RIGHT_SINGLE_QUOTATION_MARK

'’'

false

false

Right single quotation mark

LEFT_DOUBLE_QUOTATION_MARK

'“'

false

false

Left double quotation mark

RIGHT_DOUBLE_QUOTATION_MARK

'”'

false

false

Right double quotation mark

Japanese Symbol Variations

Symbols in Japanese has vary by the author and the writing group. RedPen provides three default symbol settings for Japanese. The variations are specified with variant attribute. Currently there are three variants for Japanese symbol settings ("zenkaku" (default), "zenkaku2" and "hankaku").

For example the following is the sample of configuration file for Japanese text with the "zenkaku2" setting.

<redpen-conf lang="ja" variant="zenkaku2">
    <validators>
        <validator name="InvalidSymbol" />
        <validator name="SpaceWithSymbol" />
        <validator name="SectionLength" />
        <validator name="ParagraphNumber" />
    </validators>
</redpen-conf>

The symbols of "hankaku" variant is the same as the symbol settings as en. The symbols of "zenkaku2" is almost the same as default "zenkaku" variant of "ja" with the following exceptions.

Symbol Value NeedBeforeSpace NeedAfterSpace InvalidChars Description

FULL_STOP

'.'

false

false

' .', '。'

Sentence period

COMMA

','

false

false

',','、'

Comma

RedPen Language Settings

RedPen processes documents written in any language, such as German, French or Japanese. RedPen default configuration will process documents written in English or other Latin-based languages. Therefore, to make RedPen process documents written in other languages, users need to specify the language in the configuration file.

Override Language

Currently there are two language settings, en, ja and ru. en is for documents written in Latin-based languages such as English and German. "ja" is for Japanese documents.

In order to override the language, you must change the "lang" property of the "symbols" section of the configuration file. In the following example, the config file specifies Japanese ("ja") as the document language.

<symbols lang="ja">
    <symbol name="EXCLAMATION_MARK" value="!" invalid-chars="" after-space="true" />
    <symbol name="LEFT_QUOTATION_MARK" value="\'"  invalid-chars="" before-space="true" />
    <symbol name="RIGHT_QUOTATION_MARK" value="\'"  invalid-chars="" after-space="true" />
    <symbol name="NUMBER_SIGN" value="#" invalid-chars="" after-space="true" />
    <symbol name="FULL_STOP" value="" invalid-chars=".." after-space="true" />
    <symbol name="COMMA" value="" invalid-chars="," after-space="true" />
</symbols>

Override Symbol Settings

Depending on the documents and their authors, the characters and symbols used may differ. For example, one author prefers "'" for a left single quotation mark, whereas another author prefers to use "‘" for a left single quotation mark.

For such cases, RedPen provides the way to override the default symbols (see Setting Symbols section) used in the document. In addition, specify which symbols should not used in the input documents.

The following overrides the setting for single quotation marks to enforce the use of the ascii quotation mark (').

<symbols>
  <symbol name="LEFT_SINGLE_QUOTATION_MARK" value="'"  invalid-chars="" />
  <symbol name="RIGHT_SINGLE_QUOTATION_MARK" value="'" invalid-chars=""/>
</symbols>

RedPen Supported Validators

RedPen supports the following validators.

SentenceLength

SentenceLength validator checks the length of sentences in the input document. If the length of the sentence is greater than the specified maximum length, the validator generates a warning.

Properties

Property Default Value Description

max_len

50

Maximum length of sentence.

Supported languages

SentenceLength can be applied to any language.

InvalidExpression

InvalidExpression validator checks if input sentences contain invalid expressions (words or phrases). If the input sentence contains invalid expressions, the validator generates a warning.

Properties

Property Default Value Description

dict

None

File name of dictionary.

list

None

List of invalid expressions delimited by commas.

The dictionary is a set of words or expressions. The following is an example of a dictionary.

like
you know
hey
kidding
...

Supported languages

InvalidExpression can be applied to any language.

InvalidWord

InvalidWord validator checks if input sentences contain invalid words. If the input sentence contains invalid words, the validator generates a warning.

Properties

Property Default Value Description

dict

None

File name of dictionary.

list

None

List of invalid expressions delimited by commas.

The dictionary is a set of words. The following is an example of a dictionary.

like
hey
wow
...

Supported Languages

InvalidWord can be any of languages (but the default dictionaries are supplied only for English and Japanese).

SpaceBeginningOfSentenceValidator

This validator checks if there is a white space at the end of sentences (except for the last sentence of paragraph). If the input sentence does end with a white space, a warning is given.

Warning
SpaceBeginningOfSentenceValidator is deprecated.

Supported languages

SpaceBeginningOfSentenceValidator can be applied to any language.

CommaNumber

CommaNumber validator checks the number of commas in a sentence.

Properties

Property Default Value Description

max_num

3

Maximum number of commas in a sentence.

Supported languages

CommaNumber can be applied to any language.

WordNumber

WordNumber validator checks the number of words in one setnece.

Properties

Property Default Value Description

max_num

30

Maximum number of words in a sentence.

Supported languages

WordNumber can be applied to any languages except for several Asian languages (Chinese or Thai). RedPen does not have the tokenizer for the languages.

SuggestExpression

SuggestExpression validator works in a similar way to the InvalidExpression validator. If the input sentence contains invalid expressions, this validator returns a warning suggesting the correct expression.

Properties

Property Default Value Description

dict

None

File name of dictionary.

map

None

List of pairs of elements. e.g. {SVM,Support Vector Machine},{like,such as}

The dictionary is a TSV file with two columns. First column contains the invalid expression, and the second column contains a suggested replacement expression.

SVM    Support Vector Machine
LLVM   Low Level Virtual Machine
...

Supported languages

SuggestExpression can be any of languages but the default dictionaries are provided only for English and Japanese.

InvalidSymbol

Some symbols or characters have alternate characters with the same role. For example question mark ? (0x003F) has another unicode variation ?(0xFF1F). InvalidSymbol checks if input sentences contains invalid characters or symbols. The symbols settings are added into the character setting block int the configuration file. In this file, we write the symbols we should use in the document and their invalid counterparts. The details of these settings is described in the next section.

Supported languages

InvalidSymbol works for any langugages. See the settings of symbols in the Configuration section.

SymbolWithSpace

Some symbols need space before or after them. For example, if we want to ensure a space before a left parentheses, we can add the preference to the symbol setting block (see Setting symbols)

Supported languages

InvalidSymbol works for any language.

KatakanaEndHyphen

KatakanaEndHyphen validator checks the end hyphens of Katakana words in Japanese documents. Japanese Katakana words have variations in their end hyphen. For example, "computer" is written in Katakana as "コンピュータ" (without hyphen), and "コンピューター" (with hypen). This validator checks to ensure that Katakana words match the predefined standard. See JIS Z8301, G.6.2.2 b) G.3.

  • a: Words of 3 characters or more cannot have an end hyphen.

  • b: Words of 2 characters or less can have an end hyphen.

  • c: A compound word should apply a and b to each component word.

  • d: In the cases from a to c, the length of a syllable which is represented by a hyphen is 1 except for Youon.

Supported languages

KatakanaEndSymbol works only for Japanees texts.

KatakanaSpellCheck

KatakanaSpellCheck validator checks if Katakana words have variational written form. For example, if the Katakana word "インデックス" and the variational form "インデクス" exist within the same document, this validator will return a warning.

Properties

Property Default Value Description

dict

None

File name of dictionary.

min_ratio

0.2

Threshold of the minimum similarity. KatakanaSpellCheck reports an error when there is a pair of words of which the similarity is more than the min_ratio.

min_freq

5

Threshold of the minimum word frequency. KatakanaSpellCheck checks words of which frequencies are less than min_freq.

Supported languages

KatakanaSpellCheck works only for Japanese texts.

SectionLength

SectionLength validator checks the maximum number of words allowed in an section.

Properties

Property Default Value Description

max_num

1000

Maximum number of characters in a section.

Supported languages

SectionLength works for any language.

ParagraphNumber

ParagraphNumber validator checks the maximum number of paragraphs allowed in one section.

Properties

Property Default Value Description

max_num

5

Maximum number of paragraphs in a section.

Supported languages

ParagraphNumber works for any language.

ParagraphStartWith

ParagraphStartWith validator checks to see if the characters at the beginning of paragraphs conforms to the correct style.

Properties

Property Default Value Description

start_with

" "

Characters in the beginning of paragraphs.

Supported languages

ParagraphStartWith works for any langugaes.

SpaceBetweenAlphabeticalWord

SpaceBetweenAlphabeticalWord validator checks that alphabetic words are surrounded with whitespace. This validator is used in non-latin languages such as Japanese or Chinese.

Properties

Property Default Value Description

forbidden

false

Speces are enforce (false) or forbidden.

skip_before

""

Skip errors when there is no space before the specifed characters (symbols).

skip_after

""

Skip errors when there is no space after the specifed characters (symbols).

Supported languages

SpaceBetweenAlphabeticalWord works for languages whose words are not split by white spaces such as Japanese or Chinese.

Contraction

Contraction throws an error when contractions are used in a document in which more than half of the verbs are written in non-contracted form.

Supported languages

Contraction works only for English texts.

Spelling

Spelling validator throws an error if there are spelling mistakes in the input documents. This validator only works for English documents.

Properties

Property Default Value Description

dict

None

File name of known word dictionary.

list

None

List of known words delimited by commas.

Supported languages

Spelling works only for English texts.

DoubledWord

DoubledWord validator throws an error if a word is used more than once in a sentence. For example, if an input document contains the following sentence, the validator will report an error since good is used twice.

this good item is very good.

Properties

Property Default Value Description

dict

None

File name of skip list dictionary.

list

None

List of skip words delimited by commas.

Supported languages

DoubledWord works for any langages except for Chiense or other Asian languages.

Note
The default dictionaries are supplied only for Japanese and English.

DoubledJoshi

DoubledJoshi throws an error if a Joshi (Japanese part-of-speech) is used more than once in a Japanese sentence.

Properties

Property Default Value Description

dict

None

File name of skip list dictionary.

list

None

List of skip words delimited by commas.

Supported languages

DoubledJoshi works only for Japanese

SuccessiveWord

SuccessiveWord validator throws an error if the same word is used twice in succession. For example, if an input document contains the following sentence, the validator will report an error since is is used twice in succession.

the item is is very good.

Supported languages

SuccessiveWord works for any langages except for Chiense or other Asian languages.

DuplicatedSection

DuplicatedSection validator throws an error if there are section pairs which have almost the same content.

Supported languages

DuplicatedSection works for any language.

JapaneseStyle

JapaneseStyle validator reports errors if the input file contains both "dearu" and "desu-masu" style.

Supported languages

JapaneseStyle works only for Japanese

DoubleNegative

DoubleNegative validator reports errors when input sentence contains double negative expression.

Supported languages

DoubleNegative works only for English and Japanese texts.

FrequentSentenceStart

This validator reports an error if too many sentences start with the same sequence of words.

Properties

Property Default Value Description

leading_word_limit

3

Number of words starting each sentence to consider.

percentage_threshold

25

Maximum percentage of sentences that can start with the same words.

min_sentence_count

5

Minimum number of sentences required for the validator to report errors.

Supported languages

FrequentSentenceStart works for any language.

UnexpandedAcronym

This validator ensures that there are candidates for expanded versions of acronyms somewhere in the document.

That is, if there exists an acronym ABC in the document, then there must also exist a sequence of capitalized words such as Axxx Bxx Cxxx.

Properties

Property Default Value Description

min_acronym_length

3

Minimum size for the acronym

Supported languages

UnexpandedAcronym works only for English text.

WordFrequency

This validator ensures that usage of specific words in the document don’t occur too frequently. It calculates the frequency that words are used and compares them the a reference histogram of word frequency for written English.

Excessive deviation from normal usage generates a validation error.

Properties

Property Default Value Description

deviation_factor

3

Permitted factor of deviation from the norm. So if a word is normally used 3% of the time, your document can use it up to 9% of the time.

min_word_count

200

Minimum number of words in a document before this validator starts to validate

Supported languages

WordFrequency works only for English text.

Hyphenation

This validator ensures that sequences of words that are hyphenated in the dictionary are hyphenated in your document.

Supported languages

Hyphenation works only for English text.

NumberFormat

This validator ensures that numbers in a sentence are formatted using commas (ie: 12,000 instead of 120000), and don’t have excessive decimal points.

Properties

Property Default Value Description

decimal_delimiter_is_comma

false

Change the decimal delimiter from . to , (as in Europe)

ignore_years

true

Ignore 4 digit integers (2015, 1998)

Supported languages

NumberFormat works for texts written in European languages such as English or French.

ParenthesizedSentence

This validator generates errors if parenthesized sentences (such as this) are used too frequently, or are nested too heavily.

Properties

Property Default Value Description

max_nesting_level

2

The limit on how many parenthesized expressions are permitted

max_count

1

The number of parenthesized expressions allowed

max_length

4

The maximum number of words in a parenthesized expression

Supported languages

ParenthesizedSentence works for any langugages.

WeakExpression

This validator generates errors if sequences of words form what is generally considered to be a "weak expression."

Supported languages

WeakExpression works only for English.

EndOfSentenceSentence

This validator generates errors if the style end of sentence is American style.

Supported languages

EndOfSentence works for English.

HankakuKana

This validator generates errors if the Hankaku Kana characters are used in input document.

Supported languages

HanakakuKana works only for Japanese.

Okurigana

This validator generates errors if input sentence uses invalid Okurigana Style (Japanese).

Supported languages

Okurigana works for Japanese.

StartWithCapitalLetter

This validator generates errors if input sentence start with a capital character.

Supported languages

This validator works for English or other european langugages.

VoidSection

This validator generates errors if sections in input documents do not have any paragraphs or sentences.

Warning
VoidSection is deprecated and removed in the future release. Please use EmptySection.

Properties

Property Default Value Description

limit

5

Skip validation to the sections smaller than specified level.

Supported languages

VoidSection works for any languages.

EmptySection

This validator generates errors if sections in input documents do not have any paragraphs or sentences.

Properties

Property Default Value Description

limit

5

Skip validation to the sections smaller than specified level.

Supported languages

EmptySection works for any languages.

GappedSection

This validator generates errors when the level of child sections (chapters) has the gap. For example, The following is a misplaced section sample.

= chapter 1
...
=== section 1.1.1
=== section 1.1.2
...

In the above example, chapter 1 should have section 1.1 before subsection 1.1.1.

Supported languages

GappedSection works for any languages.

LongKanjiChain

This validator generates errors when input sentences has a words consist of too many Kanji characters.

In the above example, chapter 1 should have section 1.1 before subsection 1.1.1.

Properties

Property Default Value Description

max_len

5

The limit on how many characters are used in succession.

Supported languages

GappedSection works for Japanese text.

SectionLevel

This validator generates errors when input documents contains smaller sections than specified.

Properties

Property Default Value Description

max_num

5

The limit of the sub-section level.

Supported languages

SectionLevel works for any languages.

JapaneseAmbiguousNounConjunction

This validator generates errors when Japanese documents contains the ambiguous noun conjunction pattern. The ambigous pattern is that two nouns are conjuncted with Joshi, no (の).

The following is a sample of this pattern.

弊社の経営方針の説明を受けた。

Supported languages

JapaneseAmbigousNounConjunction works for Japanese.

JapaneseNumberExpression

JapaneseNumberExpression checks if the number expressions in the input text are in the consistent style.

Properties

Property Default Value Description

mode

numeric

Style of number expression. There is four types of styles ("numeric", "numeric-zenkaku", "kansuji", "hiragana").

Each style expects the following number expression.

Style Sample

numeric

1つ、2つ

numeric-zenkaku

1つ、2つ

kansuji

一つ、二つ

hiragana

ひとつ、ふたつ

Supported languages

JapaneseNumberExpression works only for Japanese text.

JapaneseJoyoKanji

This validator generates errors when Japanese documents contains a non-joyo kanji.

The following is a sample of this pattern.

踵を返して出て行った。

Supported languages

JapaneseJoyoKanji works for Japanese.

JapaneseExpressionVariation

JapaneseExpressionVariation checks spelling variation of Japanese words. The function is similar to KatakanaSpellCheck. This validator covers all types of Japanese words consist of not only Katakana but also Hiragana or Kanji.

Properties

Property Default Value Description

dict

None

File name of dictionary.

map

None

List of pairs of elements. e.g. {smart,スマート},{distributed,ディストリビューテッド}

Supported languages

JapaneseNumberExpression works only for Japanese text.

SuccessiveSentence

SuccessiveSentence throws an error when it find almost the same sentences are in succession. This validator is useful to check the human error as follows.

The component is useful for testing. Especially for unit level testing. Especially for unit level testing. Of course we can apply it for higher level testing.

In the above sample, the same sentences are used in succession. This is a human error.

Properties

Property Default Value Description

dist

3

Threshold of minimum distance in Edit Distance

min_len

5

Minimum sentence length to compute

Supported languages

SuccessiveSentence works for any languages.

ListLevel

ListLevel checks the level of list items. This validator generates errors when input sections contain list items nested too deeply.

Properties

Property Default Value Description

max_level

5

The maximum level of list items

The following example generates an error at the six list item if max_level is five.

* one
** two
*** three
**** four
***** five
****** six

Supported languages

ListLevel works for any languages.

Error Suppression

Sometimes writers do not want to fix errors from RedPen. The reason is the cost to remove error is high or writer intend to break the writing standard at particular points. For such cases, RedPen support error suppression by annotation. The annotations are added just before the section containing the errors. Currently error suppression is supported for four types of formats (AsciiDoc, Markdown, Re:VIEW, LaTeX).

AsciiDoc

For AsciiDoc text, writers add the suppress annotation in attribute block. The annotation is [suppress]. For example, the following AsciiDoc text suppresses the all the erorrs in the section.

[suppress]
= Instances
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk, and Memory.

When the writers want to suppress only the specified errors, they add Validator names after suppress. For example, the following example suppresses only two types of errors (Contraction WeakExpression).

[suppress='Contraction WeakExpression']
= Instances
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk and Memory.
Note
Multi line section is not supported for suppression error with annotation.

Markdown

For Markdown text, writers add the suppress annotation in HTML comments. Specifically Writers add annotation, @suppress in HTML comments. For example, the following text suppress all the error in the section.

<!-- @suppress -->
# Instances
Some software tools work in more than one machine, and such _distributed_ (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk and Memory.

To suppress only specific errors, writers add the Validator names following @suppress annotation. The following sample suppresses the two type errors (Contraction WeakExpression).

<!-- @suppress Contraction WeakExpression -->
# Instances
Some software tools work in more than one machine, and such _distributed_ (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk and Memory.
Note
Multi line comments style is not supported.

Re:VIEW

For Re:VIEW text, writers add the suppress annotation in comments. Specifically Writers add annotation, @suppress in Re:VIEW comment #@#. For example, the following text suppress all the error in the section.

#@# @suppress
= Instances
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk and Memory.

To suppress only specific errors, writers add the Validator names following @suppress annotation. The following sample suppresses the two type errors (Contraction WeakExpression).

#@# @suppress Contraction WeakExpression
= Instances
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk and Memory.

LaTeX

For LaTeX texts, writers can add Markdown style annotation in LaTeX comment (%).

reStructuredText

For reStructuredText texts, writers can add suppress annotation in inline comments as follows.

.. @suppress Contraction WeakExpression

Distributed system
##################

Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources.

Extending RedPen

We can make RedPen extensions with Java and JavaScript. In the following sections, we learn the way to make the extensions.

Extending RedPen with Java

RedPen users can extend RedPen by creating new Validators. This section describes how we construct Validators and covers the basics of the internal document model used by Validators.

RedPen users can create new validators for themselves or their organization. Adding validator is simple - just write a class which extends the abstract class Validator.

Extending Validators

There are several methods which users can implment for their methods (validate, prevalidate and init).

validate methods

Minimally, we just need to implement the "validate" template method.

Users need to implement the one of validate methods provided by the Validator class. Currently there are three validate methods with three different parameters.

/**
 * validate the input document and returns the invalid points.
 * {@link cc.redpen.validator.Validator} provides empty implementation. Validator implementation validates documents can override this method.
 *
 * @param document input
 */
 public void validate(Document document)

 /**
  * validate the input document and returns the invalid points.
  * {@link cc.redpen.validator.Validator} provides empty implementation. Validator implementation validates sentences can override this method.
  *
  * @param sentence input
  */
  public void validate(Sentence sentence)

 /**
  * validate the input document and returns the invalid points.
  * {@link cc.redpen.validator.Validator} provides empty implementation. Validator implementation validates sections can override this method.
  *
  * @param section input
  */
  public void validate(Section section)
Note
The implemented class needs to be in one of the packages: 'cc.redpen.validator', 'cc.redpen.validator.sentence' or 'cc.redpen.validator.section.'

prevalidate method

The preValidate method called before validate method is run. This method is useful to create the prerequisite to run validate method. There are two prevalidte methods variations which have different parameters.

/**
 * Process input blocks before run validation. This method is used to store
 * the information needed to run Validator before the validation process.
 *
 * @param sentence input sentence
 */
 public void preValidate(Sentence sentence)

 /**
  * Process input blocks before run validation. This method is used to store
  * the information needed to run Validator before the validation process.
  *
  * @param section input section
  */
  public void preValidate(Section section)

Configuration properties

If validator needs Configuration properties, validator constructor can be used to define them. This way RedPen will understand that your validator supports these properties and various plugins can show them in a GUI.

/**
 * @param keyValues String key and Object value pairs for supported config properties.
 */
public Validator(Object...keyValues)

For example, SentencelengthValidator has a configuration property max_len, which to specifies the maximum length of sentences in input documents. Following configuration specifies the maximum length to 200.

<redpen-conf>
    <validators>
        ...
        <validator name="SentenceLength">
            <property name="max_len" value="200"/>
        </validator>
        ...
    </validators>
</redpen-conf>

SentenceLengthValidator loads the value of max_len automatically if it is defined in validator’s constructor. If max_len is missing from the config, the value is left as default.

public SentenceLengthValidator() {
  super("max_len", 30); // Default maximum length of sentences.
}

Later configuration value can be accessed via getInt("max_len") or similar methods.

In case you need to process config properties before they can be used in validate() method, you can override the init() method.

Adding Validators

There are two ways to add your Validator to RedPen.

One of the way is to add the Validator source file to the RedPen source tree and then build RedPen normally. This method is simple, but involves bundling the source code for the Validators with the source code for RedPen.

The second way is to create a Validator plugin. Creating a plugin has the advantage that you can then independently manage the source code for your Validator.

Note that in both cases, your Validator’s class name must have the suffix Validator.

In the next section, we will first explain the way to add a new Validator to Redpen’s source tree. Then we weil go thourgh the way to construct a Validator plugin.

Add a Validator to RedPen source

Let’s define a plain Validator (SentenceLengthValidator) which check if the input sentence is over 100 characters long. Then we will apply it to RedPen’s source tree.

SentenceLengthValidator

We create a PlainSentenceLengthValidator class and specify the package cc.redpen.validator.sentence. Therefore the class is stored in the redpen/redpen-core/src/main/java/cc/redpen/validator/sentence/ directory.

The following is an implementation of this class.

package cc.redpen.validator.sentence;

/**
 * Validate input sentences contain more characters more than specified.
 */
public class PlainSentenceLengthValidator extends Validator {

  /**
   * Default constructor initializes properties with their default values.
   */
  public PlainSentenceLengthValidator() {
    super("max_len", 30); // Default maximum length of sentences.
  }

  @Override
  public void validate(Sentence sentence) {
    if (sentence.getContent().length() > getInt("max_len")) {
      addValidationError(sentence, sentence.getContent().length(), maxLength);
    }
  }
}

The class has a validate method which takes a Sentence object as its parameter. When this class is registered in the configuration file, RedPen automatically applies the method to each sentence in the input document.

Register a new Validator to configuration file

To register your Validator in the RedPen configuration, add the Validator’s name removing the Validator suffix, to a configuration file.

For example, to activate our newly created Validator PlainSentenceLengthValidator, include the validator element as follows:

<redpen-conf>
    <validator>
        ...
        <validator name="PlainSentenceLength" />
        ...
    </validator>
</redpen-conf>

We would then run RedPen normally, using this configuration file.

Create a Validator plugin

When creating a Validator plugin, it is often easier to start by using another plugin’s project as a template.

As an example, I have written a simple Validator plugin hankaku_kana_validator.

The most significant file for the plugin is pom.xml which exists at the top of the project. This file is the Maven configuration file, which is a popular software project management tool for Java.

The following is the content of pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>redpen.cc</groupId>
    <artifactId>hankaku-kana-validator</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>hankaku-kana-validator</name>
    <url>http://maven.apache.org</url>
    <dependencies>
         <dependency>
             <groupId>redpen.cc</groupId>
             <artifactId>redpen-core</artifactId>
             <version>1.2</version>
             <scope>system</scope>
             <systemPath>${project.basedir}/lib/redpen-core-0.6.jar</systemPath>
         </dependency>
    </dependencies>
</project>

Usually you do not need to change the pom.xml file, except for the contents of the artifact-id and name elements. You should alter the name to fit the function of your Validator.

After changing pom.xml, you should delete the existing validator file (HankakuKanaValidator.java) from "main/java/cc/redpen/validator/sentence." Then, put your Validator’s source file in "main/java/cc/redpen/validator/sentence" or "main/java/cc/redpen/validator/section."

Once you have included your Validator implementation, you can build the plugin.

$ mvn install
Including a user-defined Validator plugin

When you have successfully built your Validator plugin, you can use the validator plugin in RedPen. To deply the plugin, copy the plugin’s jar file from the target directory to a directory in the classpath, such as the library directory ($REDPEN_HOME/lib). Once copied, you can add your Validator to the configuration file as described above. Remember to remove the Validator suffix from the name you enter in redpen-config.xml.

Extending RedPen in JavaScript

For those who are unfamiliar with Java, RedPen version 1.3 introduced JavaScriptValidator, which is a special validator that loads Validator implementations written in JavaScript.

Enabling JavaScriptValidator

To enable JavaScriptValidator, simply add <validator name="JavaScript"/> in your redpen-conf.xml as follows:

<redpen-conf lang="en">
  <validators>
       ...snip...
    <validator name="JavaScript" />
  </validators>
</redpen-conf>

Write your own validator in JavaScript

JavaScriptValidator will load all files with .js suffix from $REDPEN_HOME/js directory. This location can be overriden with script-path property of JavaScriptValidator. You can set multiple 'script-path' property.

<redpen-conf lang="en">
  <validators>
       ...snip...
     <validator name="JavaScript" >
       <property name="script-path" value="/path/to/your/validator/directory-a" />
     </validator>
     <validator name="JavaScript" >
       <property name="script-path" value="/path/to/your/validator/directory-b" />
     </validator>
  </validators>
</redpen-conf>

Functions with the following signature will be called upon validation time:

function preValidateSentence(sentence) {
}

function preValidateSection(section) {
}

function validateDocument(document) {
  // your validation logic for document here
}

function validateSentence(sentence) {
  // if(your validation logic for sentence here) {
  //   addError('validation error message', sentence);
  // }
}

function validateSection(section) {
  // your validation logic for section here
}

Example

Here is a JavaScript version of NumberOfCharacterValidator:

var MIN_LENGTH = 100;
var MAX_LENGTH = 1000;

function validateSentence(sentence) {
  if (sentence.getContent().length() < MIN_LENGTH) {
    addError("Sentence is shorter than "
      + MIN_LENGTH + " characters long.", sentence);
  }
  if (sentence.getContent().length() > MAX_LENGTH) {
    addError("Sentence is longer than " + MAX_LENGTH
      + " characters long.", sentence);
  }
}

The code looks pretty much similar to the Java version. The main difference is that the callback method validate(Sentence sentence) is referred as validateSentence(sentence) in the JavaScript version.

Run

Of course, it is JavaScript and there is no need to compile your validator. Actually your JavaScript code will be compiled into Java byte-code by Nashorn, and it runs fast than you expect. JavaScriptValidator will pick any *.js file located in $REDPEN_HOME/js directory. You can simply run the redpen command to get your file validated by the js validator.

$ ./bin/redpen -c myredpen-conf.xml 2be-validated.txt
2be-validated.txt:1: ValidationError[JavaScript], [NumberOfCharacter.js] Sentence is shorter than 100 characters long. at line: very short sentence.

Properties of JavaScript based extensions

We can spcify properties to JavaScript based validators. Adding properties, we can change the behaviors of the validators. Users specify the prooperties to the configuration block for the JavaScript validator. The following exmaple specify the value of propery, max_char_num to 5.

<validator name="JavaScript">
  <property name="max_char_num" value="5" />
</validator>

Specified properties in JavaScript configuration block are avaiable in JavaScript extensions. For example, the following extension get the value of the property, max_char_num with getInt method.

function validateSentence(sentence) {
  var content = sentence.getContent().split(" ");
  var limit= getInt("max_char_num");

  for(var i = 0; i<content.length;i++){
    if(content[i].length >= limit){
      addError("word [" + content[i] +"] is too long. length: " + content[i].length, sentence);
    }
  }
}

RedPen provides methods for getting values of the properties for various types The following table shows the list of provided get methods.

Method name Type

getInt

Int

getFloat

Float

getString

String

getBoolean

Boolean

getSet

Set

Model Structure

This section describes the internal document model structure generated by parser objects.

Generated RedPen documents consist of several blocks, which represent the elements of a document.

  • DocumentCollection represents a set of one or more files that contain a Document.

  • Document represents a single file which contains one or more Sections.

  • Section contains several blocks (Header, Paragraph, ListBlock). Except for Header, each Section can contain multiple blocks. A Section may also specify the section level and its subsections.

  • Header represents header sentences that contain a list of Sentence objects.

  • Paragraph contains one or more sentences.

  • ListBlock contains a set of ListElement objects.

The following image shows the document model used by RedPen.

image

RedPen Server

The RedPen server delivers the majority of RedPen’s functionality via a simple HTTP REST API.

Starting and stopping the RedPen server

Please refer to the Commands section for details on how-to start the RedPen server.

Heroku Button

RedPen server is also launch in the Heroku environment with few clicks. In the bottom of RedPen README, there is a Heroku Button.

Image

Clicking the buttom, a RedPen server get started in Heroku. Actually the RedPen sample server in the RedPen home page is working in Heroku environment.

RedPen Server API

Configuration

/rest/config/redpens

Return the configuration for available, preconfigured redpens.

GET Parameters
  • lang=xx restricts the returned configurations to those that match the specified language. By default, all configurations are returned.

The JSON response is as follows:

{
  "version": "1.1.2",
  "documentParsers": ["PLAIN", "MARKDOWN", "WIKI"],
  "redpens": {
     "en": {
      "lang": "en",
      "tokenizer": "cc.redpen.tokenizer.WhiteSpaceTokenizer",
      "validators": {
         "CommaNumber": { "languages": [], "properties": {} },
         "Contraction": { "languages": ["en"], "properties": {} },
         "DoubledWord": { "languages": [], "properties": {} },
         "EndOfSentence": { "languages": ["en"], "properties": {} },
         "InvalidExpression": { "languages": [], "properties": {} },
         "InvalidSymbol": { "languages": [], "properties": {} },
         "InvalidWord": { "languages": ["en"], "properties": {} },
         "ParagraphNumber": { "languages": [], "properties": {} },
         "Quotation": { "languages": ["en"], "properties": {} },
         "SectionLength": { "languages": [], "properties": {"max_char_num": "2000"} },
         "SentenceLength": { "languages": [], "properties": {"max_len": "200"} },
         "SpaceBetweenAlphabeticalWord": { "languages": [], "properties": {} },
         "Spelling": { "languages": [], "properties": {} },
         "StartWithCapitalLetter": { "languages": ["en"], "properties": {} },
         "SuccessiveWord": { "languages": [], "properties": {} },
         "SymbolWithSpace": { "languages": [], "properties": {} },
         "WordNumber": { "languages": [], "properties": {} }
      }
    },
    "ja": {
      "lang": "ja",
      "tokenizer": "cc.redpen.tokenizer.JapaneseTokenizer",
      "validators": {
         "CommaNumber": { "languages": [], "properties": {} },
         "DoubledWord": { "languages": [], "properties": {} },
         "HankakuKana": { "languages": ["ja"], "properties": {} },
         "InvalidSymbol": { "languages": [], "properties": {} },
         "KatakanaEndHyphen": { "languages": ["ja"], "properties": {} },
         "KatakanaSpellCheck": { "languages": ["ja"], "properties": {} },
         "ParagraphNumber": { "languages": [], "properties": {} },
         "SectionLength": { "languages": [], "properties": {"max_num": "1500"} },
         "SentenceLength": { "languages": [], "properties": {"max_len": "100"} },
         "SpaceBetweenAlphabeticalWord": { "languages": [], "properties": {} },
         "SuccessiveWord": { "languages": [], "properties": {} }
      }
    }
  }
}
  • The version property indicates the version of RedPen.

  • The documentParsers array contains all supported document parsers

  • The redpens object shows the available pre-configured redpens and how they are configured. Within each object:

    • lang specifies the language the redpen is designed for

    • tokenizer specifies the tokenizer class used by the redpen

    • validators shows which validators are configured within the redpen. This object is in a format suitable for the document/validate/json request below. For each validator:

      • The languages array indicates which languages for which the validator is suitable. An empty array indicates all languages.

      • The properties object specifies the currently configured properties for this validator, as described in validator

Document Validation

/rest/document/validate

This POST request validates a document and returns the errors.

POST Parameters
  • document contains the text of the document RedPen is to validate

  • documentParser specifies which parser should be used to parse the

    document. Valid options are
    • PLAIN

    • MARKDOWN

    • WIKI

  • lang specifies the language used to tokenize the document. Currently, values of ja (Japanese) and en (English/Whitespace) are supported.

  • The optional format field determines the format for the results. It can be one of json (the default), json2, plain, plain2 or xml.

  • The optional config field contains the contents of a RedPen XML configuration file

Examples using curl and document/validate
$ curl --data document="Twas brillig and the slithy toves did gyre and gimble in the wabe" \
     --data lang=en --data format=PLAIN2 \
     --data config="`cat ./redpen-server/target/classes/conf/redpen-conf.xml`" \
     localhost:8080/rest/document/validate/
Line: 1, Offset: 0
    Sentence: Twas brillig and the slithy toves did gyre and gimble in the wabe
        Spelling: Found possibly misspelled word "brillig".
        Spelling: Found possibly misspelled word "slithy".
        Spelling: Found possibly misspelled word "toves".
        Spelling: Found possibly misspelled word "gyre".
        Spelling: Found possibly misspelled word "gimble".
        Spelling: Found possibly misspelled word "wabe".
        DoubledWord: Found repeated word "and".
$ curl -s --data document="古池や,蛙飛び込む水の音" \
          --data config="`cat ./redpen-server/target/classes/conf/redpen-conf-ja.xml`" \
          localhost:8080/rest/document/validate/ | json_reformat
{
    "errors": [
        {
            "sentence": "古池や,蛙飛び込む水の音",
            "endPosition": {
                "offset": 4,
                "lineNum": 1
            },
            "validator": "InvalidSymbol",
            "lineNum": 1,
            "sentenceStartColumnNum": 0,
            "message": "Found invalid symbol \",\".",
            "startPosition": {
                "offset": 3,
                "lineNum": 1
            }
        }
    ]
}

/rest/document/validate/json

This POST request processes a redpen validation request, specified in JSON, and returns redpen errors in a supported RedPen format.

Request format
{
  "document": "Theyre is a blak rownd borl.",
  "format": "json2",
  "documentParser": "PLAIN",
  "config": {
    "lang": "en",
    "validators": {
      "CommaNumber": {},
      "Contraction": {},
      "DoubledWord": {},
      "EndOfSentence": {},
      "InvalidExpression": {},
      "InvalidSymbol": {},
      "InvalidWord": {},
      "ParagraphNumber": {},
      "Quotation": {},
      "SectionLength": {
        "properties": {
          "max_char_num": "2000"
        }
      },
      "SentenceLength": {
        "properties": {
          "max_len": "200"
        }
      },
      "SpaceBetweenAlphabeticalWord": {},
      "Spelling": {},
      "StartWithCapitalLetter": {},
      "SuccessiveWord": {},
      "SymbolWithSpace": {},
      "WordNumber": {}
    },
    "symbols": {
      "AMPERSAND": {
        "after_space": false,
        "before_space": true,
        "invalid_chars": "",
        "value": "&"
      },
      "ASTERISK": {
        "after_space": true,
        "before_space": true,
        "invalid_chars": "",
        "value": "*"
      }
    }
  }
}
  • The document property specifies the text of the document to validate

  • The documentParser property should contain the name of a valid RedPen documentparser (ie: PLAIN, MARKDOWN or WIKI)

  • The format property determines the format for the results. It can be one of json, json2, plain, plain2 or xml.

  • The config object specifies the validator configuration for the request. This consists of:

    • A config object, consisting of a series of objects that are named after a RedPen validator. If the object is present, the validator will be configured. Within this named object, a properties object can be used to set the name and values of any property used by the validator.

    • The lang property indicates the language of the document. It determines how the document will be tokenized by RedPen.

    • A symbols object containing overridden symbols, as described in configuration. Each entry must be a validate symbol name, and can contain the following elements:

      • value specifies the Symbol’s value

      • invalid_chars is a string of invalid alternatives for this Symbol

      • before_space and after_space specify if a space is required before or after the Symbol.

Response (json2 format):

{
  "errors": [
    {
      "sentence": "Theyre is a blak rownd borl.",
      "position": {
        "start": {
          "offset": 0,
          "line": 1
        },
        "end": {
          "offset": 27,
          "line": 1
        }
      },
      "errors": [
        {
          "subsentence": {
            "offset": 0,
            "length": 6
          },
          "validator": "Spelling",
          "position": {
            "start": {
              "offset": 0,
              "line": 1
            },
            "end": {
              "offset": 6,
              "line": 1
            }
          },
          "message": "Found possibly misspelled word \"Theyre\"."
        },
        {
          "subsentence": {
            "offset": 12,
            "length": 4
          },
          "validator": "Spelling",
          "position": {
            "start": {
              "offset": 12,
              "line": 1
            },
            "end": {
              "offset": 16,
              "line": 1
            }
          },
          "message": "Found possibly misspelled word \"blak\"."
        },
        {
          "subsentence": {
            "offset": 17,
            "length": 5
          },
          "validator": "Spelling",
          "position": {
            "start": {
              "offset": 17,
              "line": 1
            },
            "end": {
              "offset": 22,
              "line": 1
            }
          },
          "message": "Found possibly misspelled word \"rownd\"."
        },
        {
          "subsentence": {
            "offset": 23,
            "length": 4
          },
          "validator": "Spelling",
          "position": {
            "start": {
              "offset": 23,
              "line": 1
            },
            "end": {
              "offset": 27,
              "line": 1
            }
          },
          "message": "Found possibly misspelled word \"borl\"."
        }
      ]
    }
  ]
}
Some examples using curl and document/validate/json
$ curl -s --data "document=fish and chips" http://localhost:8080/rest/document/validate | json_reformat
{
    "errors": [
        {
            "sentence": "fish and chips",
            "validator": "StartWithCapitalLetter",
            "lineNum": 1,
            "sentenceStartColumnNum": 0,
            "message": "Sentence starts with a lowercase character \"f\"."
        }
    ]
}
$ curl -s --data "document=ここはどこでうか?&lang=ja&" http://localhost:8080/rest/document/validate | json_reformat
{
    "errors": [
        {
            "sentence": "ここはどこでうか?",
            "endPosition": {
                "offset": 9,
                "lineNum": 1
            },
            "validator": "InvalidSymbol",
            "lineNum": 1,
            "sentenceStartColumnNum": 0,
            "message": "Found invalid symbol \"?\".",
            "startPosition": {
                "offset": 8,
                "lineNum": 1
            }
        }
    ]
}
$ curl -s --data "document=# Markdown Test%0A%0ASpellink Errah&lang=en&documentParser=MARKDOWN" http://localhost:8080/rest/document/validate | json_reformat
{
    "errors": [
        {
            "sentence": "Spellink Errah",
            "endPosition": {
                "offset": 8,
                "lineNum": 3
            },
            "validator": "Spelling",
            "lineNum": 3,
            "sentenceStartColumnNum": 0,
            "message": "Found possibly misspelled word \"Spellink\".",
            "startPosition": {
                "offset": 0,
                "lineNum": 3
            }
        },
        {
            "sentence": "Spellink Errah",
            "endPosition": {
                "offset": 14,
                "lineNum": 3
            },
            "validator": "Spelling",
            "lineNum": 3,
            "sentenceStartColumnNum": 0,
            "message": "Found possibly misspelled word \"Errah\".",
            "startPosition": {
                "offset": 9,
                "lineNum": 3
            }
        }
    ]
}
curl -s -H "Content-Type: application/json" \
     --data '{document:"fisch and chipps",format:"plain",config:{validators:{Spelling:{},SentenceLength:{properties:{max_len:6}}}}}' \
     http://localhost:8080/rest/document/validate/json
1: ValidationError[Spelling], Found possibly misspelled word "fisch". at line: fisch and chipps
1: ValidationError[Spelling], Found possibly misspelled word "chipps". at line: fisch and chipps
1: ValidationError[SentenceLength], The length of the sentence (16) exceeds the maximum of 6. at line: fisch and chipps

For Developers

Feature requests are warm welcome through the GitHub. If you are interested in joining the development team. The following instructions help you to understand building RedPen development environment. Contributions in the form of pull requests are easy to integrate to the master branch.

Preliminary

To build RedPen, you need to install the following tools.

Development Policy

We will develop this tiny tool slowly but steadily. Therefore we would not be able to accept the proposal containing drastic changes, even when the proposed change is useful for the project.

Build RedPen

Building all the RedPen package is simple. You just run the following command.

$./mvnw clean install

The built RedPen package is flushed in redpen/redpen-distribution/target. After you get the built RedPen package, you unzip it with the following command.

$ cd redpen/redpen-distribution/target
$ tar xvf redpen-distribution-*.*.*-SNAPSHOT-assembled.tar.gz
x redpen-distribution-*.*.*-SNAPSHOT/bin/
x redpen-distribution-*.*.*-SNAPSHOT/conf/
x redpen-distribution-*.*.*-SNAPSHOT/js/
x redpen-distribution-*.*.*-SNAPSHOT/js/test/
x redpen-distribution-*.*.*-SNAPSHOT/lib/
...

Now the RedPen package is ready to use.

$cd redpen-distribution-*.*.*-SNAPSHOT
$bin/redpen -h
usage: redpen-cli [Options] [<INPUT FILE>]
  -c,--conf <CONF FILE>                Configuration file (REQUIRED)
  -f,--format <FORMAT>                 Input file format
                                       (markdown,plain,wiki,asciidoc,latex)
  -h,--help                            Displays this help information and exits
  -l,--limit <LIMIT NUMBER>            error limit number
  -r,--result-format <RESULT FORMAT>   Output result format
                                       (json,json2,plain,plain2,xml)
  -v,--version                         Displays version information and exits

Testing

When you modify RedPen source files, please make sure to add the test cases to make the reviewer understand the expected behaviors. In addition run whole test cases before making the Pull Request. RedPen test cases are run with the following command.

 ./mvnw test

Text Editor / IDE

Some software engineers develop RedPen packages for various editors such as Emacs or Atom. This section introduces the RedPen packages. With the following packages, we can enjoy the linting texts in the editors or IDEs.

Atom

Atom is a modern text editor which works in various platforms such as OS X, Linux, and Windows. linter-redpen is an sophisticated Atom package created by griffin-stewie. This provides real-time checking of texts.

Emacs

Emacs is a real-time display text editor. Many software engineers love this editor for the customisablity. redpen-paragraph is a Emacs package for RedPen created by karronoli.

Vim

Famous linter plugin ale provides Redpen integration for automatic proofreading.

IntelliJ IDEA

Image

IntelliJ IDEA is a popular Java IDE (Integrated Development Environment) provided JetBrains. But this IDE also provides many plugins to support editing markup texts such as AsiiDoc or Markdown. We have been developing IntelliJ IDEA plugin.

RedPen FAQ

This section shows the frequent asked questions and the answers.

  • Does RedPen check LaTeX documents for common style errors?

    RedPen supports LaTex format from version 1.4. The following is a sample usage of the redpen command.

    $ redpen -c redpen-conf.xml -f latex content.tex