class Cadmium::Classifier::Bayes

Overview

This is a native-bayes classifier which used Laplace Smoothing. It can be trained to categorize sentences based on the words in that sentence.

Example:

classifier = Cadmium::Classifier::Bayes.new

# Train some angry examples
classifier.train("omg I can't believe you would do that to me", "angry")
classifier.train("I hate you so much!", "angry")
classifier.train("Just go. I don't need this.", "angry")
classifier.train("You're so full of shit!", "angry")

# Some happy ones
classifier.train("omg you're the best!", "happy")
classifier.train("I can't believe how happy you make me", "happy")
classifier.train("I love you so damn much!", "happy")
classifier.train("You're the best!", "happy")

# And some indifferent ones
classifier.train("Idk, what do you think?", "indifferent")
classifier.train("yeah that's ok", "indifferent")
classifier.train("cool", "indifferent")
classifier.train("I guess we could do that", "indifferent")

# Now let's test it on a sentence
classifier.classify("You shit head!")
# => {"angry" => 85.5, "happy" => 10.2, "indifferent" => 4.3}

# Or just get the top category
classifier.classify_category("You're the best :)")
# => "happy"

Included Modules

Defined in:

cadmium/classifier/bayes.cr

Constant Summary

DEFAULT_TOKENIZER = Cadmium::Tokenizer::Word.new

Constructors

Instance Method Summary

Constructor Detail

def self.new(ctx : YAML::ParseContext, node : YAML::Nodes::Node) #

[View source]
def self.new(pull : JSON::PullParser) #

[View source]
def self.new(pull : MessagePack::Unpacker) #

[View source]
def self.new(tokenizer = nil) #

[View source]

Instance Method Detail

def categories : Array(String) #

Category names


[View source]
def classify(text : String) #

Determines what category the text belongs to. Returns a Hash with all categories and their probabilities.


[View source]
def classify_category(text : String) : String #

Convenience method that returns just the top category name instead of all probabilities. Use this when you only need the most likely category.

Example:

classifier = Cadmium::Classifier::Bayes.new
classifier.train("I love this!", "positive")
classifier.train("This is terrible", "negative")
classifier.classify_category("This is amazing!") # => "positive"

[View source]
def doc_count : Hash(String, Int32) #

Document frequency table for each of our categories.


[View source]
def frequency_table(tokens) #

Build a frequency hash map where

  • the keys are the entries in tokens
  • the values are the frequency of each entry in tokens

[View source]
def initialize_category(name) #

Intializes each of our data structure entities for this new category and returns self.


[View source]
def token_probability(token, category) #

Calculate the probaility that a token belongs to a category.


[View source]
def tokenizer : Cadmium::Tokenizer::Base #

[View source]
def tokenizer=(tokenizer : Cadmium::Tokenizer::Base) #

[View source]
def total_documents : Int32 #

Number of documents we have learned from.


[View source]
def train(text, category) #

Train our native-bayes classifier by telling it what category the train text corresponds to.


[View source]
def vocabulary : Set(String) #

The words to learn from. Using Set for O(1) lookups instead of Array's O(n).


[View source]
def vocabulary_size : Int32 #

The total number of words in the vocabulary


[View source]
def word_count : Hash(String, Int32) #

For each category, how many total words were mapped to it.


[View source]
def word_frequency_count : Hash(String, Hash(String, Int32)) #

Word frequency table for each category.


[View source]