class
Cadmium::Classifier::Bayes
- Cadmium::Classifier::Bayes
- Reference
- Object
Overview
This is a native-bayes classifier which used Laplace Smoothing. It can be trained to categorize sentences based on the words in that sentence.
Example:
classifier = Cadmium::Classifier::Bayes.new
# Train some angry examples
classifier.train("omg I can't believe you would do that to me", "angry")
classifier.train("I hate you so much!", "angry")
classifier.train("Just go. I don't need this.", "angry")
classifier.train("You're so full of shit!", "angry")
# Some happy ones
classifier.train("omg you're the best!", "happy")
classifier.train("I can't believe how happy you make me", "happy")
classifier.train("I love you so damn much!", "happy")
classifier.train("You're the best!", "happy")
# And some indifferent ones
classifier.train("Idk, what do you think?", "indifferent")
classifier.train("yeah that's ok", "indifferent")
classifier.train("cool", "indifferent")
classifier.train("I guess we could do that", "indifferent")
# Now let's test it on a sentence
classifier.classify("You shit head!")
# => {"angry" => 85.5, "happy" => 10.2, "indifferent" => 4.3}
# Or just get the top category
classifier.classify_category("You're the best :)")
# => "happy"
Included Modules
- JSON::Serializable
- MessagePack::Serializable
- YAML::Serializable
Defined in:
cadmium/classifier/bayes.crConstant Summary
-
DEFAULT_TOKENIZER =
Cadmium::Tokenizer::Word.new
Constructors
- .new(ctx : YAML::ParseContext, node : YAML::Nodes::Node)
- .new(pull : JSON::PullParser)
- .new(pull : MessagePack::Unpacker)
- .new(tokenizer = nil)
Instance Method Summary
-
#categories : Array(String)
Category names
-
#classify(text : String)
Determines what category the
textbelongs to. -
#classify_category(text : String) : String
Convenience method that returns just the top category name instead of all probabilities.
-
#doc_count : Hash(String, Int32)
Document frequency table for each of our categories.
-
#frequency_table(tokens)
Build a frequency hash map where - the keys are the entries in
tokens- the values are the frequency of each entry intokens -
#initialize_category(name)
Intializes each of our data structure entities for this new category and returns
self. -
#token_probability(token, category)
Calculate the probaility that a
tokenbelongs to acategory. - #tokenizer : Cadmium::Tokenizer::Base
- #tokenizer=(tokenizer : Cadmium::Tokenizer::Base)
-
#total_documents : Int32
Number of documents we have learned from.
-
#train(text, category)
Train our native-bayes classifier by telling it what
categorythe traintextcorresponds to. -
#vocabulary : Set(String)
The words to learn from.
-
#vocabulary_size : Int32
The total number of words in the vocabulary
-
#word_count : Hash(String, Int32)
For each category, how many total words were mapped to it.
-
#word_frequency_count : Hash(String, Hash(String, Int32))
Word frequency table for each category.
Constructor Detail
Instance Method Detail
Determines what category the text belongs to.
Returns a Hash with all categories and their probabilities.
Convenience method that returns just the top category name instead of all probabilities. Use this when you only need the most likely category.
Example:
classifier = Cadmium::Classifier::Bayes.new
classifier.train("I love this!", "positive")
classifier.train("This is terrible", "negative")
classifier.classify_category("This is amazing!") # => "positive"
Build a frequency hash map where
- the keys are the entries in
tokens - the values are the frequency of each entry in
tokens
Intializes each of our data structure entities for this
new category and returns self.
Calculate the probaility that a token belongs to
a category.
Train our native-bayes classifier by telling it what
category the train text corresponds to.
The words to learn from. Using Set for O(1) lookups instead of Array's O(n).
Word frequency table for each category.