class Cadmium::PragmaticTokenizer

Overview

This tokenizer is based off of the pragmatic_tokenizer ruby gem. It is much more robust than any of the other tokenizers, but has more features than you'll need for most use cases.

Constructor Options

filter_languages : Array - user-supplied array of languages from which that language's stop words, abbreviations and contractions should be used when calculating the resulting tokens
language : Symbol | String - two character ISO 639-1 code; can be a String or symbol (default :en)
expand_contractions : Bool - (default: false)
remove_stop_words : Bool - (default: false)
abbreviations : Set(String) - user-supplied array of abbreviations (each element should be downcased with final period removed)
stop_words : Set(String) - user-supplied array of stop words - array elements should be of the String class
contractions : Hash(String, String) - user-supplied hash of contractions (key is the contracted form; value is the expanded form - both the key and value should be downcased)
punctuation : PunctuationOptions - see description below
- all: Does not remove any punctuation from the result
- semi: Removes common punctuation (such as full stops) and does not remove less common punctuation (such as questions marks). This is useful for text alignment as less common punctuation can help identify a sentence (like a fingerprint) while common punctuation (like stop words) should be removed.
- none: Removes all punctuation from the result
- only: Removes everything except punctuation. The returned result is an array of only the punctuation.
numbers : NumberOptions - see description below
- all: Does not remove any numbers from the result
- semi: Removes tokens that include only digits
- none: Removes all tokens that include a number from the result (including Roman numerals)
- only: Removes everything except tokens that include a number
minimum_length : Int32 - minimum length of the token in characters
long_word_split : Int32 - the specified length to split long words at any hyphen or underscore. 0 = no split (default).
mentions : MentionOptions - what to do with mentions (such as '@watzon')
- remove: will completely remove it
- keep_and_clean: will prefix
- keep_original: don't alter the token at all (default)
hashtags : HashtagOptions - what to do with hashtags (such as '#crystal')
- remove: will completely remove it,
- keep_and_clean: will prefix
- keep_original: don't alter the token at all (default)
downcase : Bool - downcase all tokens (default: true)
clean : Bool - removes some symbols (default: false)
classic_filter : Bool - removes dots from acronyms and 's from the end of tokens. [link](# https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ClassicFilter) (default: false)
remove_emoji : Bool - strip emojis (default: false)
remove_emails : Bool - strip emails (default: false)
remove_urls : Bool - strip urls (default: false)
remove_domains : Bool - strip domains (default: false)

Examples

tokenizer = Cadmium::Tokenizers::Pragmatic.new
tokenizer.tokenize("Hello world.")
# => ["hello", "world", "."]

tokenizer.tokenize("Jan. 2015 was 20% colder than now. But not in inter- and outer-space.")
# => ["jan.", "2015", "was", "20%", "colder", "than", "now", ".", "but", "not", "in", "inter", "-", "and", "outer-space", "."]

tokenizer.contractions = {"supa'soo" => "super smooth"}
tokenizer.expand_contractions = true
tokenizer.tokenize("Hello supa'soo guy.")
# => ["hello", "super", "smooth", "guy", "."]

tokenizer.clean = true
tokenizer.tokenize("This sentence has a long string of dots .......................")
# => ["this", "sentence", "has", "a", "long", "string", "of", "dots"]

Defined in:

Constant Summary

DOT = "."
MAX_TOKEN_LENGTH = 50
NOTHING = ""
SINGLE_QUOTE = "'"
SPACE = " "

Constructors

.new(*, language = :en, abbreviations = Set(String).new, stop_words = Set(String).new, contractions = {} of String => String, filter_languages = [] of String | Symbol, hashtags : MentionsOptions = :keep_original, mentions : MentionsOptions = :keep_original, punctuation : PunctuationOptions = :all, numbers : NumbersOptions = :all, expand_contractions : Bool = false, remove_stop_words : Bool = false, remove_emoji : Bool = false, remove_emails : Bool = false, remove_urls : Bool = false, remove_domains : Bool = false, clean : Bool = false, classic_filter : Bool = false, downcase : Bool = true, minimum_length : Int32 = 0, long_word_split : Int32 = 0)
Creates a new Pragmatic tokenizer.

Instance Method Summary

#abbreviations : Set(String)
Set of recognized abbreviations
#abbreviations=(abbreviations : Set(String))
Set of recognized abbreviations
#classic_filter : Bool
Run the classic filter?
#classic_filter=(classic_filter : Bool)
Run the classic filter?
#clean : Bool
Run the cleaner after we've tokenized?
#clean=(clean : Bool)
Run the cleaner after we've tokenized?
#contractions : Hash(String, String)
Contractions to be replaced
#contractions=(contractions : Hash(String, String))
Contractions to be replaced
#downcase : Bool
Downcase all tokens?
#downcase=(downcase : Bool)
Downcase all tokens?
#expand_contractions : Bool
Do we want to expand contractions ("he's" => "he is")
#expand_contractions=(expand_contractions : Bool)
Do we want to expand contractions ("he's" => "he is")
#filter_languages : Array(String | Symbol)
Other languages to include in the filtering of abbreviations, contractions, and stop words
#filter_languages=(filter_languages : Array(String | Symbol))
Other languages to include in the filtering of abbreviations, contractions, and stop words
#hashtags : MentionsOptions
What to do with hashtags (#awesome)
#hashtags=(hashtags : MentionsOptions)
What to do with hashtags (#awesome)
#long_word_split : Int32
The specified length to split long words at any hyphen or underscore
#long_word_split=(long_word_split : Int32)
The specified length to split long words at any hyphen or underscore
#mentions : MentionsOptions
What to do with mentions (@watzon)
#mentions=(mentions : MentionsOptions)
What to do with mentions (@watzon)
#minimum_length : Int32
Minimum length for tokens
#minimum_length=(minimum_length : Int32)
Minimum length for tokens
#numbers : NumbersOptions
What to do with numbers
#numbers=(numbers : NumbersOptions)
What to do with numbers
#punctuation : PunctuationOptions
What to do with punctuation
#punctuation=(punctuation : PunctuationOptions)
What to do with punctuation
#remove_domains : Bool
Should we remove domains
#remove_domains=(remove_domains : Bool)
Should we remove domains
#remove_emails : Bool
Should we remove emails
#remove_emails=(remove_emails : Bool)
Should we remove emails
#remove_emoji : Bool
Should we remove emojis
#remove_emoji=(remove_emoji : Bool)
Should we remove emojis
#remove_stop_words : Bool
Should we remove stop words
#remove_stop_words=(remove_stop_words : Bool)
Should we remove stop words
#remove_urls : Bool
Should we remove urls
#remove_urls=(remove_urls : Bool)
Should we remove urls
#stop_words : Set(String)
An array of stop words
#stop_words=(stop_words : Set(String))
An array of stop words
#tokenize(string : String) : Array(String)
#tokens : Array(String)
Array of output tokens

Instance methods inherited from class `Cadmium::Tokenizer`

Constructor Detail

def self.new(*, language = :en, abbreviations = Set(String).new, stop_words = Set(String).new, contractions = {} of String => String, filter_languages = [] of String | Symbol, hashtags : MentionsOptions = :keep_original, mentions : MentionsOptions = :keep_original, punctuation : PunctuationOptions = :all, numbers : NumbersOptions = :all, expand_contractions : Bool = false, remove_stop_words : Bool = false, remove_emoji : Bool = false, remove_emails : Bool = false, remove_urls : Bool = false, remove_domains : Bool = false, clean : Bool = false, classic_filter : Bool = false, downcase : Bool = true, minimum_length : Int32 = 0, long_word_split : Int32 = 0) #

Creates a new Pragmatic tokenizer.

[View source]

Instance Method Detail

def abbreviations : Set(String) #

Set of recognized abbreviations

[View source]

def abbreviations=(abbreviations : Set(String)) #

Set of recognized abbreviations

[View source]

def classic_filter : Bool #

Run the classic filter?

[View source]

def classic_filter=(classic_filter : Bool) #

Run the classic filter?

[View source]

def clean : Bool #

Run the cleaner after we've tokenized?

[View source]

def clean=(clean : Bool) #

Run the cleaner after we've tokenized?

[View source]

def contractions : Hash(String, String) #

Contractions to be replaced

[View source]

def contractions=(contractions : Hash(String, String)) #

Contractions to be replaced

[View source]

def downcase : Bool #

Downcase all tokens?

[View source]

def downcase=(downcase : Bool) #

Downcase all tokens?

[View source]

def expand_contractions : Bool #

Do we want to expand contractions ("he's" => "he is")

[View source]

def expand_contractions=(expand_contractions : Bool) #

Do we want to expand contractions ("he's" => "he is")

[View source]

def filter_languages : Array(String | Symbol) #

Other languages to include in the filtering of abbreviations, contractions, and stop words

[View source]

def filter_languages=(filter_languages : Array(String | Symbol)) #

Other languages to include in the filtering of abbreviations, contractions, and stop words

[View source]

def hashtags : MentionsOptions #

What to do with hashtags (#awesome)

[View source]

def hashtags=(hashtags : MentionsOptions) #

What to do with hashtags (#awesome)

[View source]

def long_word_split : Int32 #

The specified length to split long words at any hyphen or underscore

[View source]

def long_word_split=(long_word_split : Int32) #

The specified length to split long words at any hyphen or underscore

[View source]

def mentions : MentionsOptions #

What to do with mentions (@watzon)

[View source]

def mentions=(mentions : MentionsOptions) #

What to do with mentions (@watzon)

[View source]

def minimum_length : Int32 #

Minimum length for tokens

[View source]

def minimum_length=(minimum_length : Int32) #

Minimum length for tokens

[View source]

def numbers : NumbersOptions #

What to do with numbers

[View source]

def numbers=(numbers : NumbersOptions) #

What to do with numbers

[View source]

def punctuation : PunctuationOptions #

What to do with punctuation

[View source]

def punctuation=(punctuation : PunctuationOptions) #

What to do with punctuation

[View source]

def remove_domains : Bool #

Should we remove domains

[View source]

def remove_domains=(remove_domains : Bool) #

Should we remove domains

[View source]

def remove_emails : Bool #

Should we remove emails

[View source]

def remove_emails=(remove_emails : Bool) #

Should we remove emails

[View source]

def remove_emoji : Bool #

Should we remove emojis

[View source]

def remove_emoji=(remove_emoji : Bool) #

Should we remove emojis

[View source]

def remove_stop_words : Bool #

Should we remove stop words

[View source]

def remove_stop_words=(remove_stop_words : Bool) #

Should we remove stop words

[View source]

def remove_urls : Bool #

Should we remove urls

[View source]

def remove_urls=(remove_urls : Bool) #

Should we remove urls

[View source]

def stop_words : Set(String) #

An array of stop words

[View source]

def stop_words=(stop_words : Set(String)) #

An array of stop words

[View source]

def tokenize(string : String) : Array(String) #

[View source]

def tokens : Array(String) #

Array of output tokens

[View source]

CrystalDoc.info

cadmium

class Cadmium::PragmaticTokenizer

Overview

Constructor Options

Examples

Defined in:

Constant Summary

Constructors

Instance Method Summary

Instance methods inherited from class Cadmium::Tokenizer

Constructor Detail

Instance Method Detail

Instance methods inherited from class `Cadmium::Tokenizer`