class Cadmium::PragmaticTokenizer::Languages::Common
- Cadmium::PragmaticTokenizer::Languages::Common
- Reference
- Object
Direct Known Subclasses
- Cadmium::PragmaticTokenizer::Languages::Bulgarian
- Cadmium::PragmaticTokenizer::Languages::Czech
- Cadmium::PragmaticTokenizer::Languages::Deutsch
- Cadmium::PragmaticTokenizer::Languages::English
- Cadmium::PragmaticTokenizer::Languages::Portuguese
- Cadmium::PragmaticTokenizer::Languages::Spanish
Defined in:
cadmium/tokenizer/pragmatic/languages/common.crConstant Summary
-
ABBREVIATIONS =
Set(String).new
-
ALNUM_QUOTE =
/(\w|\D)'(?!')(?=\W|$)/
-
Single quotes handling
-
CONTRACTIONS =
{} of String => String
-
PUNCTUATION_MAP =
{"。" => "♳", "." => "♴", "." => "♵", "!" => "♶", "!" => "♷", "?" => "♸", "?" => "♹", "、" => "♺", "¡" => "⚀", "¿" => "⚁", "„" => "⚂", "“" => "⚃", "[" => "⚄", "]" => "⚅", "\"" => "☇", "#" => "☈", "$" => "☉", "%" => "☊", "&" => "☋", "(" => "☌", ")" => "☍", "*" => "☠", "+" => "☢", "," => "☣", ":" => "☤", ";" => "☥", "<" => "☦", "=" => "☧", ">" => "☀", "@" => "☁", "^" => "☂", "_" => "☃", "`" => "☄", "'" => "☮", "{" => "♔", "|" => "♕", "}" => "♖", "~" => "♗", "-" => "♘", "«" => "♙", "»" => "♚", "”" => "⚘", "‘" => "⚭"}
-
QUOTE_NOT_TWAS1 =
/(\W|^)'(?!twas)/i
-
QUOTE_NOT_TWAS2 =
/(\W|^)‘(?!twas)/i
-
QUOTE_WORD =
/(\W|^)'(?=\w)/
-
STOP_WORDS =
Set(String).new
Class Method Summary
- .abbreviations
- .contractions
-
.handle_single_quotes(text)
This 'special treatment' is actually relevant for many other tests.
- .punctuation_map
- .stop_words
Class Method Detail
def self.handle_single_quotes(text)
#
This 'special treatment' is actually relevant for many other tests. Alter core regular expressions!