class Cadmium::Tokenizer::Sentence

Defined in:

cadmium/tokenizer/sentence.cr

Constant Summary

ABBR_DETECT = /(?:\s(?:(?:(?:\w\.){2,}\w?)|(?:\w\.\w)))/: Finds abbreviations, like e.g., i.e., U.S., u.S., U.S.S.R.
CORRECT_ABBR = /(#{ABBR_DETECT})#{EOS}(\s+[a-z0-9])/
DAYS = ["mon", "tue", "wed", "thu", "fri", "sat", "sun"]
ENTITIES = ["dept", "univ", "uni", "assn", "bros", "inc", "ltd", "co", "corp", "plc"]
EOS = "\u0001"
MISC = ["vs", "etc", "no", "esp", "cf"]
MONTHS = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec", "sept"]
PUNCTUATION_DETECT = /((?:[\.?!]|[\r\n]+)(?:\"|\'|\)|\]|\})?)(\s+)/: Finds punctuation that ends paragraphs.
STREETS = ["ave", "bld", "blvd", "cl", "ct", "cres", "dr", "rd", "st"]
TITLES = ["jr", "mr", "mrs", "ms", "dr", "prof", "sr", "sen", "rep", "rev", "gov", "atty", "supt", "det", "rev", "col", "gen", "lt", "cmdr", "adm", "capt", "sgt", "cpl", "maj"]

Class Method Summary

.abbreviation(*abbreviations)
Adds a list of abbreviations to the list that's used to detect false sentence ends.

Instance Method Summary

#tokenize(string : String) : Array(String)
Split the passed string into individual sentences, trim these and return as an array.

Instance methods inherited from class `Cadmium::Tokenizer::Base`

Instance methods inherited from module `Cadmium::Tokenizer::Diacritics`

Instance methods inherited from module `Cadmium::Tokenizer::StopWords`

Class Method Detail

def self.abbreviation(*abbreviations) #

Adds a list of abbreviations to the list that's used to detect false sentence ends. Return the current list of abbreviations in use.

[View source]

Instance Method Detail

def tokenize(string : String) : Array(String) #

Split the passed string into individual sentences, trim these and return as an array. A sentence is marked by one of the punctuation marks ".", "?" or "!" followed by whitespace. Sequences of full stops (such as an ellipsis marker "..." and stops after a known abbreviation are ignored.

[View source]

CrystalDoc.info

cadmium_tokenizer

class Cadmium::Tokenizer::Sentence

Defined in:

Constant Summary

Class Method Summary

Instance Method Summary

Instance methods inherited from class `Cadmium::Tokenizer::Base`

Instance methods inherited from module `Cadmium::Tokenizer::Diacritics`

Instance methods inherited from module `Cadmium::Tokenizer::StopWords`

Class Method Detail

Instance Method Detail

CrystalDoc.info

cadmium_tokenizer

class Cadmium::Tokenizer::Sentence

Defined in:

Constant Summary

Class Method Summary

Instance Method Summary

Instance methods inherited from class Cadmium::Tokenizer::Base

Instance methods inherited from module Cadmium::Tokenizer::Diacritics

Instance methods inherited from module Cadmium::Tokenizer::StopWords

Class Method Detail

Instance Method Detail

Instance methods inherited from class `Cadmium::Tokenizer::Base`

Instance methods inherited from module `Cadmium::Tokenizer::Diacritics`

Instance methods inherited from module `Cadmium::Tokenizer::StopWords`