class Cadmium::Tokenizer::Sentence

Defined in:

cadmium/tokenizer/sentence.cr

Constant Summary

ABBR_DETECT = /(?:\s(?:(?:(?:\w\.){2,}\w?)|(?:\w\.\w)))/

Finds abbreviations, like e.g., i.e., U.S., u.S., U.S.S.R.

CORRECT_ABBR = /(#{ABBR_DETECT})#{EOS}(\s+[a-z0-9])/
DAYS = ["mon", "tue", "wed", "thu", "fri", "sat", "sun"]
ENTITIES = ["dept", "univ", "uni", "assn", "bros", "inc", "ltd", "co", "corp", "plc"]
EOS = "\u0001"
MISC = ["vs", "etc", "no", "esp", "cf"]
MONTHS = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec", "sept"]
PUNCTUATION_DETECT = /((?:[\.?!]|[\r\n]+)(?:\"|\'|\)|\]|\})?)(\s+)/

Finds punctuation that ends paragraphs.

STREETS = ["ave", "bld", "blvd", "cl", "ct", "cres", "dr", "rd", "st"]
TITLES = ["jr", "mr", "mrs", "ms", "dr", "prof", "sr", "sen", "rep", "rev", "gov", "atty", "supt", "det", "rev", "col", "gen", "lt", "cmdr", "adm", "capt", "sgt", "cpl", "maj"]

Class Method Summary

Instance Method Summary

Instance methods inherited from class Cadmium::Tokenizer::Base

tokenize(string : String) : Array(String) tokenize, trim(arr) trim

Instance methods inherited from module Cadmium::Tokenizer::Diacritics

remove_diacritics(str : String) remove_diacritics

Instance methods inherited from module Cadmium::Tokenizer::StopWords

add_stopwords_list(language : Symbol) add_stopwords_list

Class Method Detail

def self.abbreviation(*abbreviations) #

Adds a list of abbreviations to the list that's used to detect false sentence ends. Return the current list of abbreviations in use.


[View source]

Instance Method Detail

def tokenize(string : String) : Array(String) #

Split the passed string into individual sentences, trim these and return as an array. A sentence is marked by one of the punctuation marks ".", "?" or "!" followed by whitespace. Sequences of full stops (such as an ellipsis marker "..." and stops after a known abbreviation are ignored.


[View source]