class Cadmium::Tokenizer::Sentence
- Cadmium::Tokenizer::Sentence
- Cadmium::Tokenizer::Base
- Reference
- Object
Defined in:
cadmium/tokenizer/sentence.crConstant Summary
-
ABBR_DETECT =
/(?:\s(?:(?:(?:\w\.){2,}\w?)|(?:\w\.\w)))/
-
Finds abbreviations, like e.g., i.e., U.S., u.S., U.S.S.R.
-
CORRECT_ABBR =
/(#{ABBR_DETECT})#{EOS}(\s+[a-z0-9])/
-
DAYS =
["mon", "tue", "wed", "thu", "fri", "sat", "sun"]
-
ENTITIES =
["dept", "univ", "uni", "assn", "bros", "inc", "ltd", "co", "corp", "plc"]
-
EOS =
"\u0001"
-
MISC =
["vs", "etc", "no", "esp", "cf"]
-
MONTHS =
["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec", "sept"]
-
PUNCTUATION_DETECT =
/((?:[\.?!]|[\r\n]+)(?:\"|\'|\)|\]|\})?)(\s+)/
-
Finds punctuation that ends paragraphs.
-
STREETS =
["ave", "bld", "blvd", "cl", "ct", "cres", "dr", "rd", "st"]
-
TITLES =
["jr", "mr", "mrs", "ms", "dr", "prof", "sr", "sen", "rep", "rev", "gov", "atty", "supt", "det", "rev", "col", "gen", "lt", "cmdr", "adm", "capt", "sgt", "cpl", "maj"]
Class Method Summary
-
.abbreviation(*abbreviations)
Adds a list of abbreviations to the list that's used to detect false sentence ends.
Instance Method Summary
-
#tokenize(string : String) : Array(String)
Split the passed string into individual sentences, trim these and return as an array.
Instance methods inherited from class Cadmium::Tokenizer::Base
tokenize(string : String) : Array(String)
tokenize,
trim(arr)
trim
Instance methods inherited from module Cadmium::Tokenizer::Diacritics
remove_diacritics(str : String)
remove_diacritics
Instance methods inherited from module Cadmium::Tokenizer::StopWords
add_stopwords_list(language : Symbol)
add_stopwords_list
Class Method Detail
def self.abbreviation(*abbreviations)
#
Adds a list of abbreviations to the list that's used to detect false sentence ends. Return the current list of abbreviations in use.
Instance Method Detail
def tokenize(string : String) : Array(String)
#
Split the passed string into individual sentences, trim these and return as an array. A sentence is marked by one of the punctuation marks ".", "?" or "!" followed by whitespace. Sequences of full stops (such as an ellipsis marker "..." and stops after a known abbreviation are ignored.