Cadmium::Tokenizer (wip)
This module contains several string tokenizers. Each one has its use cases and some are faster (or way slower) than others.
Installation
-
Add the dependency to your
shard.yml
:dependencies: cadmium_tokenizer: github: cadmiumcr/tokenizer
-
Run
shards install
Usage
require "cadmium_tokenizer"
Aggressive Tokenizer
The aggressive tokenizer currently has localization available for:
- English (:en)
- Spanish (:es)
- Persian (:fa)
- French (:fr)
- Indonesian (:id)
- Dutch (:nl)
- Norwegian (:no)
- Polish (:pl)
- Portuguese (:pt)
- Russian (:ru)
- Serbian (:sb)
- Ukranian (:uk)
- Bulgarian (:bg)
- Swedish (:sv)
If no language is included it will default to English.
Use it like so:
tokenizer = Cadmium.aggressive_tokenizer.new(lang: :es)
tokenizer.tokenize("hola yo me llamo eduardo y esudié ingeniería")
# => ["hola", "yo", "me", "llamo", "eduardo", "y", "esudié", "ingeniería"]
Case Tokenizer
The case tokenizer doesn't rely on Regex and as such should be pretty fast. It should also work on an international basis fairly easily.
tokenizer = Cadmium.case_tokenizer.new
tokenizer.tokenize("these are strings")
# => ["these", "are", "strings"]
tokenizer = Cadmium.case_tokenizer.new(preserve_apostrophes: true)
tokenizer.tokenize("Affectueusement surnommé « Gabo » dans toute l'Amérique latine")
# => ["Affectueusement", "surnommé", "Gabo", "dans", "toute", "l", "Amérique", "latine"]
Regex Tokenizer
The whitespace tokenizer, word punctuation tokenizer, and word tokenizer all extend the regex tokenizer. It uses Regex to match on the correct values.
tokenizer = Cadmium.word_punctuation_tokenizer.new
tokenizer.tokenize("my dog hasn't any fleas.")
# => ["my", "dog", "hasn", "'", "t", "any", "fleas", "."]
Treebank Word Tokenizer
The treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre. To read about treebanks you can visit wikipedia.
tokenizer = Cadmium.treebank_word_tokenizer.new
tokenizer.tokenize("If we 'all' can't go. I'll stay home.")
# => ["If", "we", "'all", "'", "ca", "n't", "go.", "I", "'ll", "stay", "home", "."]
Pragmatic Tokenizer
The pragmatic tokenizer is based off of the ruby gem from diasks2 which you can find here. It is a multilengual tokenizer which provides a wide array of options for tokenizing strings. For complete documentation check here.
Example is taken directly from the diasks2/pragmatic_tokenizer documentation, with a few modifications. Currently supported languages are:
- English (:en)
- Deutsch (:de)
- Czech (:cz)
- Bulgarian (:bg)
- Spanish (:sp)
- Portuguese (:pt)
text = "\"I said, 'what're you? Crazy?'\" said Sandowsky. \"I can't afford to do that.\""
Cadmium.pragmatic_tokenizer.new.tokenize(text)
# => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", ".", "\"", "i", "can't", "afford", "to", "do", "that", ".", "\""]
The initializer accepts the following options:
language: :en, # the language of the string you are tokenizing
abbreviations: Set{"a.b", "a"}, # a user-supplied array of abbreviations (downcased with ending period removed)
stop_words: Set{"is", "the"}, # a user-supplied array of stop words (downcased)
remove_stop_words: true, # remove stop words
contractions: { "i'm" => "i am" }, # a user-supplied hash of contractions (key is the contracted form; value is the expanded form - both the key and value should be downcased)
expand_contractions: true, # (i.e. ["isn't"] will change to two tokens ["is", "not"])
filter_languages: [:en, :de], # process abbreviations, contractions and stop words for this array of languages
punctuation: :none, # see below for more details
numbers: :none, # see below for more details
remove_emoji: true, # remove any emoji tokens
remove_urls: true, # remove any urls
remove_emails: true, # remove any emails
remove_domains: true, # remove any domains
hashtags: :keep_and_clean, # remove the hastag prefix
mentions: :keep_and_clean, # remove the @ prefix
clean: true, # remove some special characters
classic_filter: true, # removes dots from acronyms and 's from the end of tokens
downcase: false, # do not downcase tokens
minimum_length: 3, # remove any tokens less than 3 characters
long_word_split: 10 # split tokens longer than 10 characters at hypens or underscores
Contributing
- Fork it (https://github.com/cadmiumcr/cadmium_tokenizer/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Contributors
- Chris Watson - creator and maintainer