class Cadmium::BayesClassifier

Overview

This is a native-bayes classifier which used Laplace Smoothing. It can be trained to categorize sentences based on the words in that sentence.

Example:

classifier = Cadmium.bayes_classifier.new

# Train some angry examples
classifier.train("omg I can't believe you would do that to me", "angry")
classifier.train("I hate you so much!", "angry")
classifier.train("Just go. I don't need this.", "angry")
classifier.train("You're so full of shit!", "angry")

# Some happy ones
classifier.train("omg you're the best!", "happy")
classifier.train("I can't believe how happy you make me", "happy")
classifier.train("I love you so damn much!", "happy")
classifier.train("You're the best!", "happy")

# And some indifferent ones
classifier.train("Idk, what do you think?", "indifferent")
classifier.train("yeah that's ok", "indifferent")
classifier.train("cool", "indifferent")
classifier.train("I guess we could do that", "indifferent")

# Now let's test it on a sentence
classifier.categorize("You shit head!")
# => "angry"

puts classifier.categorize("You're the best :)")
# => "happy"

classifier.categorize("idk, my bff jill?")
# => "indifferent"

Included Modules

Defined in:

cadmium/classifier/bayes.cr

Constant Summary

DEFAULT_TOKENIZER = Cadmium::WordTokenizer.new

Constructors

Instance Method Summary

Constructor Detail

def self.new(ctx : YAML::ParseContext, node : YAML::Nodes::Node) #

[View source]
def self.new(pull : JSON::PullParser) #

[View source]
def self.new(tokenizer = nil) #

[View source]

Instance Method Detail

def categories : Array(String) #

Category names


[View source]
def categorize(text) #

Determines what category the text belongs to.


[View source]
def doc_count : Hash(String, Int32) #

Document frequency table for each of our categories.


[View source]
def frequency_table(tokens) #

Build a frequency hash map where

  • the keys are the entries in tokens
  • the values are the frequency of each entry in tokens

[View source]
def initialize_category(name) #

Intializes each of our data structure entities for this new category and returns self.


[View source]
def token_probability(token, category) #

Calculate the probaility that a token belongs to a category.


[View source]
def tokenizer : Cadmium::Tokenizer #

[View source]
def tokenizer=(tokenizer : Cadmium::Tokenizer) #

[View source]
def total_documents : Int32 #

Number of documents we have learned from.


[View source]
def train(text, category) #

Train our native-bayes classifier by telling it what category the train text corresponds to.


[View source]
def vocabulary : Array(String) #

The words to learn from.


[View source]
def word_count : Hash(String, Int32) #

For each category, how many total words were mapped to it.


[View source]
def word_frequency_count : Hash(String, Hash(String, Int32)) #

Word frequency table for each category.


[View source]