class Cadmium::Classifier::Bayes

Overview

This is a native-bayes classifier which used Laplace Smoothing. It can be trained to categorize sentences based on the words in that sentence.

Example:

classifier = Cadmium::Classifier::Bayes.new

# Train some angry examples
classifier.train("omg I can't believe you would do that to me", "angry")
classifier.train("I hate you so much!", "angry")
classifier.train("Just go. I don't need this.", "angry")
classifier.train("You're so full of shit!", "angry")

# Some happy ones
classifier.train("omg you're the best!", "happy")
classifier.train("I can't believe how happy you make me", "happy")
classifier.train("I love you so damn much!", "happy")
classifier.train("You're the best!", "happy")

# And some indifferent ones
classifier.train("Idk, what do you think?", "indifferent")
classifier.train("yeah that's ok", "indifferent")
classifier.train("cool", "indifferent")
classifier.train("I guess we could do that", "indifferent")

# Now let's test it on a sentence
classifier.classify("You shit head!")
# => "angry"

puts classifier.classify("You're the best :)")
# => "happy"

classifier.classify("idk, my bff jill?")
# => "indifferent"

Included Modules

JSON::Serializable
YAML::Serializable

Defined in:

cadmium/classifier/bayes.cr

Constant Summary

DEFAULT_TOKENIZER = Cadmium::Tokenizer::Word.new

Constructors

.new(ctx : YAML::ParseContext, node : YAML::Nodes::Node)
.new(pull : JSON::PullParser)
.new(tokenizer = nil)

Instance Method Summary

#categories : Array(String)
Category names
#classify(text : String)
Determines what category the text belongs to.
#doc_count : Hash(String, Int32)
Document frequency table for each of our categories.
#frequency_table(tokens)
Build a frequency hash map where - the keys are the entries in tokens - the values are the frequency of each entry in tokens
#initialize_category(name)
Intializes each of our data structure entities for this new category and returns self.
#token_probability(token, category)
Calculate the probaility that a token belongs to a category.
#tokenizer : Cadmium::Tokenizer::Base
#tokenizer=(tokenizer : Cadmium::Tokenizer::Base)
#total_documents : Int32
Number of documents we have learned from.
#train(text, category)
Train our native-bayes classifier by telling it what category the train text corresponds to.
#vocabulary : Array(String)
The words to learn from.
#vocabulary_size : Int32
The total number of words in the vocabulary
#word_count : Hash(String, Int32)
For each category, how many total words were mapped to it.
#word_frequency_count : Hash(String, Hash(String, Int32))
Word frequency table for each category.

Constructor Detail

def self.new(ctx : YAML::ParseContext, node : YAML::Nodes::Node) #

[View source]

def self.new(pull : JSON::PullParser) #

[View source]

def self.new(tokenizer = nil) #

[View source]

Instance Method Detail

def categories : Array(String) #

Category names

[View source]

def classify(text : String) #

Determines what category the text belongs to.

[View source]

def doc_count : Hash(String, Int32) #

Document frequency table for each of our categories.

[View source]

def frequency_table(tokens) #

Build a frequency hash map where

the keys are the entries in tokens
the values are the frequency of each entry in tokens

[View source]

def initialize_category(name) #

Intializes each of our data structure entities for this new category and returns self.

[View source]

def token_probability(token, category) #

Calculate the probaility that a token belongs to a category.

[View source]

def tokenizer : Cadmium::Tokenizer::Base #

[View source]

def tokenizer=(tokenizer : Cadmium::Tokenizer::Base) #

[View source]

def total_documents : Int32 #

Number of documents we have learned from.

[View source]

def train(text, category) #

Train our native-bayes classifier by telling it what category the train text corresponds to.

[View source]

def vocabulary : Array(String) #

The words to learn from.

[View source]

def vocabulary_size : Int32 #

The total number of words in the vocabulary

[View source]

def word_count : Hash(String, Int32) #

For each category, how many total words were mapped to it.

[View source]

def word_frequency_count : Hash(String, Hash(String, Int32)) #

Word frequency table for each category.

[View source]

CrystalDoc.info

cadmium_classifier

class Cadmium::Classifier::Bayes

Overview

Included Modules

Defined in:

Constant Summary

Constructors

Instance Method Summary

Constructor Detail

Instance Method Detail