Marisa
Marisa is a Crystal shard that wraps the very efficient Marisa Trie C++ library. See also Marisa Trie's README. This is based off of the Ruby version.
A "trie" is a useful data structure for storing strings, especially ngrams. It's useful for autocompletions, spell checking, IP routing, text prediction and many more things.
Installation
-
Add the dependency to your
shard.yml
:dependencies: marisa: git: https://codeberg.org/bendangelo/marisa.cr.git
-
Run
shards install
Usage
require "marisa"
trie = Marisa::Trie.new
trie.add("snow")
trie.add("snow cone")
# same as above
trie << "ice cream"
# can add weight
trie.add("ice", 1_f32)
trie.get_weight "ice"
# => 1.0e-45_f32
# add many
trie.add_many(["icicle", "snowball"])
trie.built?
# => false
trie.search("ice").keys
# => ["ice", "ice cream"]
# after running a query
trie.built?
# => true
trie.get_id("snow")
# => 0
trie.get_id("not in store")
# => nil
trie.get_key(0)
# => "snow"
trie.keys
# => ["snow", ...]
trie.each do |key|
# => "snow"
# => "snow cone"
# etc...
end
trie.size
# => 3
trie.has_keys?
# => true
You can also save and load complete tries. Marisa Tries are space efficient, often achieving significant compression over regular string or hash storage.
trie = Marisa::Trie.new(["snow", "snow cone"])
trie.save("winter.trie")
trie = Marisa::Trie.new
trie.load("winter.trie")
trie.include? "snow"
# => true
# @keys An array of UTF-8 strings
# @weights An array of corresponding weights
# @opts
# :binary Boolean, true for a binary Trie, false for text
# :num_tries An integer from 1 to 127 representing the depth of recursive Tries
# :cache_size One of [:tiny, :small, :normal, :large, :huge]
# :order One of [:label, :weight]
trie = Marisa::Trie.new(
["test"],
[1.0_f32],
binary: true,
num_tries: 10,
cache_size: :large,
order: :weight
)
trie = Marisa::BytesTrie.new "one" => "1", "two" => "2", "onetwo" => "3"
trie["one"]
# => "1"
trie.get_all("one")
# => ["1", "3"]
trie.each_pairs do |(key, value)|
# => ("one, "1")
# => ("two, "2")
# etc...
end
Marisa includes an IntTrie type that can be used to time- and space-efficiently store integer values in association with a string:
trie = Marisa::IntTrie.new "one" => 1, "two" => 2, "onetwo" => 3
trie["one"]
# => 1
trie.get_all("one")
# => [1, 3]
trie.each_pairs do |(key, value)|
# => ("one", 1)
# => ("two", 2)
# etc...
end
trie.sum("one")
# => 4
Features
- fast search for exact strings and prefixes
- has a BytesTrie that can be used to store binary data
- has an IntTrie that can be used to store integer values easily
- Crustal bindings for Marisa Trie built into the gem (require 'marisa')
Development
shards install
make all && crystal spec
Contributing
- Fork it (https://codeberg.org/bendangelo/marisa.cr/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Contributors
- Ben D'Angelo - creator and maintainer