Marisa

Marisa is a Crystal shard that wraps the very efficient Marisa Trie C++ library. See also Marisa Trie's README. This is based off of the Ruby version.

A "trie" is a useful data structure for storing strings, especially ngrams. It's useful for autocompletions, spell checking, IP routing, text prediction and many more things.

Installation

  1. Add the dependency to your shard.yml:

    dependencies:
      marisa:
        git: https://codeberg.org/bendangelo/marisa.cr.git
  2. Run shards install

Usage

require "marisa"

trie = Marisa::Trie.new
trie.add("snow")
trie.add("snow cone")

# same as above
trie << "ice cream"

# can add weight
trie.add("ice", 1_f32)

trie.get_weight "ice"
# => 1.0e-45_f32

# add many
trie.add_many(["icicle", "snowball"])

trie.built?
# => false

trie.search("ice").keys
# => ["ice", "ice cream"]

# after running a query
trie.built?
# => true

trie.get_id("snow")
# => 0

trie.get_id("not in store")
# => nil

trie.get_key(0)
# => "snow"

trie.keys
# => ["snow", ...]

trie.each do |key|
    # => "snow"
    # => "snow cone"
    # etc...
end

trie.size
# => 3

trie.has_keys?
# => true

You can also save and load complete tries. Marisa Tries are space efficient, often achieving significant compression over regular string or hash storage.

trie = Marisa::Trie.new(["snow", "snow cone"])
trie.save("winter.trie")
trie = Marisa::Trie.new
trie.load("winter.trie")

trie.include? "snow"
# => true
# @keys    An array of UTF-8 strings
# @weights An array of corresponding weights
# @opts
#   :binary      Boolean, true for a binary Trie, false for text
#   :num_tries   An integer from 1 to 127 representing the depth of recursive Tries
#   :cache_size  One of [:tiny, :small, :normal, :large, :huge]
#   :order       One of [:label, :weight]
trie = Marisa::Trie.new(
    ["test"],
    [1.0_f32],
    binary: true,
    num_tries: 10,
    cache_size: :large,
    order: :weight
)
trie = Marisa::BytesTrie.new "one" => "1", "two" => "2", "onetwo" => "3"
trie["one"]
# => "1"

trie.get_all("one")
# => ["1", "3"]

trie.each_pairs do |(key, value)|
    # => ("one, "1")
    # => ("two, "2")
    # etc...
end

Marisa includes an IntTrie type that can be used to time- and space-efficiently store integer values in association with a string:

trie = Marisa::IntTrie.new "one" => 1, "two" => 2, "onetwo" => 3
trie["one"]
# => 1

trie.get_all("one")
# => [1, 3]

trie.each_pairs do |(key, value)|
    # => ("one", 1)
    # => ("two", 2)
    # etc...
end

trie.sum("one")
# => 4

Features

Development

shards install
make all && crystal spec

Contributing

  1. Fork it (https://codeberg.org/bendangelo/marisa.cr/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Contributors