class Cadmium::Tokenizer::Pragmatic
- Cadmium::Tokenizer::Pragmatic
- Cadmium::Tokenizer::Base
- Reference
- Object
Overview
This tokenizer is based off of the pragmatic_tokenizer ruby gem. It is much more robust than any of the other tokenizers, but has more features than you'll need for most use cases.
Constructor Options
- filter_languages : Array- user-supplied array of languages from which that language's stop words, abbreviations and contractions should be used when calculating the resulting tokens
- language : Symbol | String- two character ISO 639-1 code; can be a String or symbol (default :en)
- expand_contractions : Bool- (default: false)
- remove_stop_words : Bool- (default: false)
- abbreviations : Set(String)- user-supplied array of abbreviations (each element should be downcased with final period removed)
- stop_words : Set(String)- user-supplied array of stop words - array elements should be of the String class
- contractions : Hash(String, String)- user-supplied hash of contractions (key is the contracted form; value is the expanded form - both the key and value should be downcased)
- punctuation : PunctuationOptions- see description below- all:Does not remove any punctuation from the result
- semi:Removes common punctuation (such as full stops) and does not remove less common punctuation (such as questions marks). This is useful for text alignment as less common punctuation can help identify a sentence (like a fingerprint) while common punctuation (like stop words) should be removed.
- none:Removes all punctuation from the result
- only:Removes everything except punctuation. The returned result is an array of only the punctuation.
 
- numbers : NumberOptions- see description below- all:Does not remove any numbers from the result
- semi:Removes tokens that include only digits
- none:Removes all tokens that include a number from the result (including Roman numerals)
- only:Removes everything except tokens that include a number
 
- minimum_length : Int32- minimum length of the token in characters
- long_word_split : Int32- the specified length to split long words at any hyphen or underscore. 0 = no split (default).
- mentions : MentionOptions- what to do with mentions (such as '@watzon')- remove:will completely remove it
- keep_and_clean:will prefix
- keep_original:don't alter the token at all (default)
 
- hashtags : HashtagOptions- what to do with hashtags (such as '#crystal')- remove:will completely remove it,
- keep_and_clean:will prefix
- keep_original:don't alter the token at all (default)
 
- downcase : Bool- downcase all tokens (default: true)
- clean : Bool- removes some symbols (default: false)
- classic_filter : Bool- removes dots from acronyms and 's from the end of tokens. [link](# https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ClassicFilter) (default: false)
- remove_emoji : Bool- strip emojis (default: false)
- remove_emails : Bool- strip emails (default: false)
- remove_urls : Bool- strip urls (default: false)
- remove_domains : Bool- strip domains (default: false)
Examples
tokenizer = Cadmium::Tokenizer::Bases::Pragmatic.new
tokenizer.tokenize("Hello world.")
# => ["hello", "world", "."]
tokenizer.tokenize("Jan. 2015 was 20% colder than now. But not in inter- and outer-space.")
# => ["jan.", "2015", "was", "20%", "colder", "than", "now", ".", "but", "not", "in", "inter", "-", "and", "outer-space", "."]
tokenizer.contractions = {"supa'soo" => "super smooth"}
tokenizer.expand_contractions = true
tokenizer.tokenize("Hello supa'soo guy.")
# => ["hello", "super", "smooth", "guy", "."]
tokenizer.clean = true
tokenizer.tokenize("This sentence has a long string of dots .......................")
# => ["this", "sentence", "has", "a", "long", "string", "of", "dots"]Defined in:
cadmium/tokenizer/pragmatic.crcadmium/tokenizer/pragmatic/languages.cr
cadmium/tokenizer/pragmatic/languages/bulgarian.cr
cadmium/tokenizer/pragmatic/languages/common.cr
cadmium/tokenizer/pragmatic/languages/czech.cr
cadmium/tokenizer/pragmatic/languages/deutsch.cr
cadmium/tokenizer/pragmatic/languages/english.cr
cadmium/tokenizer/pragmatic/languages/french.cr
cadmium/tokenizer/pragmatic/languages/portuguese.cr
cadmium/tokenizer/pragmatic/languages/spanish.cr
cadmium/tokenizer/pragmatic/post_processor.cr
cadmium/tokenizer/pragmatic/pre_processor.cr
cadmium/tokenizer/pragmatic/regex.cr
Constant Summary
- 
        DOT = "."
- 
        MAX_TOKEN_LENGTH = 50
- 
        NOTHING = ""
- 
        SINGLE_QUOTE = "'"
- 
        SPACE = " "
Constructors
Instance Method Summary
- 
        #abbreviations : Set(String)
        
          Set of recognized abbreviations 
- 
        #abbreviations=(abbreviations : Set(String))
        
          Set of recognized abbreviations 
- 
        #classic_filter : Bool
        
          Run the classic filter? 
- 
        #classic_filter=(classic_filter : Bool)
        
          Run the classic filter? 
- 
        #clean : Bool
        
          Run the cleaner after we've tokenized? 
- 
        #clean=(clean : Bool)
        
          Run the cleaner after we've tokenized? 
- 
        #contractions : Hash(String, String)
        
          Contractions to be replaced 
- 
        #contractions=(contractions : Hash(String, String))
        
          Contractions to be replaced 
- 
        #downcase : Bool
        
          Downcase all tokens? 
- 
        #downcase=(downcase : Bool)
        
          Downcase all tokens? 
- 
        #expand_contractions : Bool
        
          Do we want to expand contractions ("he's" => "he is") 
- 
        #expand_contractions=(expand_contractions : Bool)
        
          Do we want to expand contractions ("he's" => "he is") 
- 
        #filter_languages : Array(Symbol)
        
          Other languages to include in the filtering of abbreviations, contractions, and stop words 
- 
        #filter_languages=(filter_languages : Array(Symbol))
        
          Other languages to include in the filtering of abbreviations, contractions, and stop words 
- 
        #hashtags : MentionsOptions
        
          What to do with hashtags ( #awesome)
- 
        #hashtags=(hashtags : MentionsOptions)
        
          What to do with hashtags ( #awesome)
- 
        #long_word_split : Int32
        
          The specified length to split long words at any hyphen or underscore 
- 
        #long_word_split=(long_word_split : Int32)
        
          The specified length to split long words at any hyphen or underscore 
- 
        #mentions : MentionsOptions
        
          What to do with mentions ( @watzon)
- 
        #mentions=(mentions : MentionsOptions)
        
          What to do with mentions ( @watzon)
- 
        #minimum_length : Int32
        
          Minimum length for tokens 
- 
        #minimum_length=(minimum_length : Int32)
        
          Minimum length for tokens 
- 
        #numbers : NumbersOptions
        
          What to do with numbers 
- 
        #numbers=(numbers : NumbersOptions)
        
          What to do with numbers 
- 
        #punctuation : PunctuationOptions
        
          What to do with punctuation 
- 
        #punctuation=(punctuation : PunctuationOptions)
        
          What to do with punctuation 
- 
        #remove_domains : Bool
        
          Should we remove domains 
- 
        #remove_domains=(remove_domains : Bool)
        
          Should we remove domains 
- 
        #remove_emails : Bool
        
          Should we remove emails 
- 
        #remove_emails=(remove_emails : Bool)
        
          Should we remove emails 
- 
        #remove_emoji : Bool
        
          Should we remove emojis 
- 
        #remove_emoji=(remove_emoji : Bool)
        
          Should we remove emojis 
- 
        #remove_stop_words : Bool
        
          Should we remove stop words 
- 
        #remove_stop_words=(remove_stop_words : Bool)
        
          Should we remove stop words 
- 
        #remove_urls : Bool
        
          Should we remove urls 
- 
        #remove_urls=(remove_urls : Bool)
        
          Should we remove urls 
- 
        #stop_words : Set(String)
        
          An array of stop words 
- 
        #stop_words=(stop_words : Set(String))
        
          An array of stop words 
- #tokenize(string : String) : Array(String)
- 
        #tokens : Array(String)
        
          Array of output tokens 
Instance methods inherited from class Cadmium::Tokenizer::Base
  
  
    
      tokenize(string : String) : Array(String)
    tokenize, 
    
  
    
      trim(arr)
    trim
    
  
    
    
  
    
  Instance methods inherited from module Cadmium::Tokenizer::Diacritics
  
  
    
      remove_diacritics(str : String)
    remove_diacritics
    
  
    
    
  
    
  Instance methods inherited from module Cadmium::Tokenizer::StopWords
  
  
    
      add_stopwords_list(language : Symbol)
    add_stopwords_list
    
  
    
    
  
    
    
    
  
    
    
    
  
Constructor Detail
Creates a new Pragmatic tokenizer.
Instance Method Detail
Do we want to expand contractions ("he's" => "he is")
Other languages to include in the filtering of abbreviations, contractions, and stop words
Other languages to include in the filtering of abbreviations, contractions, and stop words
The specified length to split long words at any hyphen or underscore