class Cadmium::Tokenizer::TreebankWord
- Cadmium::Tokenizer::TreebankWord
- Cadmium::Tokenizer::Base
- Reference
- Object
Overview
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed.
Defined in:
cadmium/tokenizer/treebank_word.crConstant Summary
-
CONTRACTIONS_2 =
[/(.)('ll|'re|'ve|n't|'s|'m|'d)\b/i, /\b(can)(not)\b/i, /\b(D)('ye)\b/i, /\b(Gim)(me)\b/i, /\b(Gon)(na)\b/i, /\b(Got)(ta)\b/i, /\b(Lem)(me)\b/i, /\b(Mor)('n)\b/i, /\b(T)(is)\b/i, /\b(T)(was)\b/i, /\b(Wan)(na)\b/i]
-
CONTRACTIONS_3 =
[/\b(Whad)(dd)(ya)\b/i, /\b(Wha)(t)(cha)\b/i]