class Cadmium::PragmaticTokenizer
- Cadmium::PragmaticTokenizer
- Cadmium::Tokenizer
- Reference
- Object
Overview
This tokenizer is based off of the pragmatic_tokenizer ruby gem. It is much more robust than any of the other tokenizers, but has more features than you'll need for most use cases.
Constructor Options
- filter_languages :
Array
- user-supplied array of languages from which that language's stop words, abbreviations and contractions should be used when calculating the resulting tokens - language :
Symbol | String
- two character ISO 639-1 code; can be a String or symbol (default :en) - expand_contractions :
Bool
- (default: false) - remove_stop_words :
Bool
- (default: false) - abbreviations :
Set(String)
- user-supplied array of abbreviations (each element should be downcased with final period removed) - stop_words :
Set(String)
- user-supplied array of stop words - array elements should be of the String class - contractions :
Hash(String, String)
- user-supplied hash of contractions (key is the contracted form; value is the expanded form - both the key and value should be downcased) - punctuation :
PunctuationOptions
- see description belowall:
Does not remove any punctuation from the resultsemi:
Removes common punctuation (such as full stops) and does not remove less common punctuation (such as questions marks). This is useful for text alignment as less common punctuation can help identify a sentence (like a fingerprint) while common punctuation (like stop words) should be removed.none:
Removes all punctuation from the resultonly:
Removes everything except punctuation. The returned result is an array of only the punctuation.
- numbers :
NumberOptions
- see description belowall:
Does not remove any numbers from the resultsemi:
Removes tokens that include only digitsnone:
Removes all tokens that include a number from the result (including Roman numerals)only:
Removes everything except tokens that include a number
- minimum_length :
Int32
- minimum length of the token in characters - long_word_split :
Int32
- the specified length to split long words at any hyphen or underscore. 0 = no split (default). - mentions :
MentionOptions
- what to do with mentions (such as '@watzon')remove:
will completely remove itkeep_and_clean:
will prefixkeep_original:
don't alter the token at all (default)
- hashtags :
HashtagOptions
- what to do with hashtags (such as '#crystal')remove:
will completely remove it,keep_and_clean:
will prefixkeep_original:
don't alter the token at all (default)
- downcase :
Bool
- downcase all tokens (default: true) - clean :
Bool
- removes some symbols (default: false) - classic_filter :
Bool
- removes dots from acronyms and 's from the end of tokens. [link](# https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ClassicFilter) (default: false) - remove_emoji :
Bool
- strip emojis (default: false) - remove_emails :
Bool
- strip emails (default: false) - remove_urls :
Bool
- strip urls (default: false) - remove_domains :
Bool
- strip domains (default: false)
Examples
tokenizer = Cadmium::Tokenizers::Pragmatic.new
tokenizer.tokenize("Hello world.")
# => ["hello", "world", "."]
tokenizer.tokenize("Jan. 2015 was 20% colder than now. But not in inter- and outer-space.")
# => ["jan.", "2015", "was", "20%", "colder", "than", "now", ".", "but", "not", "in", "inter", "-", "and", "outer-space", "."]
tokenizer.contractions = {"supa'soo" => "super smooth"}
tokenizer.expand_contractions = true
tokenizer.tokenize("Hello supa'soo guy.")
# => ["hello", "super", "smooth", "guy", "."]
tokenizer.clean = true
tokenizer.tokenize("This sentence has a long string of dots .......................")
# => ["this", "sentence", "has", "a", "long", "string", "of", "dots"]
Defined in:
cadmium/tokenizer/pragmatic/languages.crcadmium/tokenizer/pragmatic/languages/bulgarian.cr
cadmium/tokenizer/pragmatic/languages/common.cr
cadmium/tokenizer/pragmatic/languages/czech.cr
cadmium/tokenizer/pragmatic/languages/deutsch.cr
cadmium/tokenizer/pragmatic/languages/english.cr
cadmium/tokenizer/pragmatic/languages/portuguese.cr
cadmium/tokenizer/pragmatic/languages/spanish.cr
cadmium/tokenizer/pragmatic/post_processor.cr
cadmium/tokenizer/pragmatic/pre_processor.cr
cadmium/tokenizer/pragmatic/regex.cr
cadmium/tokenizer/pragmatic_tokenizer.cr
Constant Summary
-
DOT =
"."
-
MAX_TOKEN_LENGTH =
50
-
NOTHING =
""
-
SINGLE_QUOTE =
"'"
-
SPACE =
" "
Constructors
Instance Method Summary
-
#abbreviations : Set(String)
Set of recognized abbreviations
-
#abbreviations=(abbreviations : Set(String))
Set of recognized abbreviations
-
#classic_filter : Bool
Run the classic filter?
-
#classic_filter=(classic_filter : Bool)
Run the classic filter?
-
#clean : Bool
Run the cleaner after we've tokenized?
-
#clean=(clean : Bool)
Run the cleaner after we've tokenized?
-
#contractions : Hash(String, String)
Contractions to be replaced
-
#contractions=(contractions : Hash(String, String))
Contractions to be replaced
-
#downcase : Bool
Downcase all tokens?
-
#downcase=(downcase : Bool)
Downcase all tokens?
-
#expand_contractions : Bool
Do we want to expand contractions ("he's" => "he is")
-
#expand_contractions=(expand_contractions : Bool)
Do we want to expand contractions ("he's" => "he is")
-
#filter_languages : Array(String | Symbol)
Other languages to include in the filtering of abbreviations, contractions, and stop words
-
#filter_languages=(filter_languages : Array(String | Symbol))
Other languages to include in the filtering of abbreviations, contractions, and stop words
-
#hashtags : MentionsOptions
What to do with hashtags (
#awesome
) -
#hashtags=(hashtags : MentionsOptions)
What to do with hashtags (
#awesome
) -
#long_word_split : Int32
The specified length to split long words at any hyphen or underscore
-
#long_word_split=(long_word_split : Int32)
The specified length to split long words at any hyphen or underscore
-
#mentions : MentionsOptions
What to do with mentions (
@watzon
) -
#mentions=(mentions : MentionsOptions)
What to do with mentions (
@watzon
) -
#minimum_length : Int32
Minimum length for tokens
-
#minimum_length=(minimum_length : Int32)
Minimum length for tokens
-
#numbers : NumbersOptions
What to do with numbers
-
#numbers=(numbers : NumbersOptions)
What to do with numbers
-
#punctuation : PunctuationOptions
What to do with punctuation
-
#punctuation=(punctuation : PunctuationOptions)
What to do with punctuation
-
#remove_domains : Bool
Should we remove domains
-
#remove_domains=(remove_domains : Bool)
Should we remove domains
-
#remove_emails : Bool
Should we remove emails
-
#remove_emails=(remove_emails : Bool)
Should we remove emails
-
#remove_emoji : Bool
Should we remove emojis
-
#remove_emoji=(remove_emoji : Bool)
Should we remove emojis
-
#remove_stop_words : Bool
Should we remove stop words
-
#remove_stop_words=(remove_stop_words : Bool)
Should we remove stop words
-
#remove_urls : Bool
Should we remove urls
-
#remove_urls=(remove_urls : Bool)
Should we remove urls
-
#stop_words : Set(String)
An array of stop words
-
#stop_words=(stop_words : Set(String))
An array of stop words
- #tokenize(string : String) : Array(String)
-
#tokens : Array(String)
Array of output tokens
Instance methods inherited from class Cadmium::Tokenizer
tokenize(string : String) : Array(String)
tokenize,
trim(arr)
trim
Constructor Detail
Creates a new Pragmatic tokenizer.
Instance Method Detail
Do we want to expand contractions ("he's" => "he is")
Other languages to include in the filtering of abbreviations, contractions, and stop words
Other languages to include in the filtering of abbreviations, contractions, and stop words
The specified length to split long words at any hyphen or underscore