class Cadmium::Tokenizer::TreebankWord

Overview

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed.

Defined in:

cadmium/tokenizer/treebank_word.cr

Constant Summary

CONTRACTIONS_2 = [/(.)('ll|'re|'ve|n't|'s|'m|'d)\b/i, /\b(can)(not)\b/i, /\b(D)('ye)\b/i, /\b(Gim)(me)\b/i, /\b(Gon)(na)\b/i, /\b(Got)(ta)\b/i, /\b(Lem)(me)\b/i, /\b(Mor)('n)\b/i, /\b(T)(is)\b/i, /\b(T)(was)\b/i, /\b(Wan)(na)\b/i]
CONTRACTIONS_3 = [/\b(Whad)(dd)(ya)\b/i, /\b(Wha)(t)(cha)\b/i]

Instance Method Summary

Instance methods inherited from class Cadmium::Tokenizer::Base

tokenize(string : String) : Array(String) tokenize, trim(arr) trim

Instance methods inherited from module Cadmium::Tokenizer::Diacritics

remove_diacritics(str : String) remove_diacritics

Instance methods inherited from module Cadmium::Tokenizer::StopWords

add_stopwords_list(language : Symbol) add_stopwords_list

Instance Method Detail

def tokenize(string : String) : Array(String) #

[View source]