Bloom Filter
Implementation of Bloom Filter in Crystal lang.
Installation
Add this to your application's shard.yml
:
dependencies:
bloom_filter:
github: crystal-community/bloom_filter
Usage
Basic
require "bloom_filter"
# Create filter with bitmap size of 32 bytes and 3 hash functions.
filter = BloomFilter.new(bytesize = 32, hash_num = 3)
# Insert elements
filter.insert("Esperanto")
filter.insert("Toki Pona")
# Check elements presence
filter.has?("Esperanto") # => true
filter.has?("Toki Pona") # => true
filter.has?("Englsh") # => false
Creating a filter with optimal parameters
Based on your needs(expected number of items and desired probability of false positives), your can create an optimal bloom filter:
# Create a filter, that with one million inserted items, gives 2% of false positives for #has? method
filter = BloomFilter.new_optimal(1_000_000, 0.02)
filter.bytesize # => 1017796 (993Kb)
filter.hash_num # => 6
Dumping into a file and loading
It's possible to save existing bloom filter as a binary file and then load it back.
filter = BloomFilter.new_optimal(2, 0.01)
filter.insert("Esperanto")
filter.dump_file("/tmp/bloom_languages")
loaded_filter = BloomFilter.load_file("/tmp/bloom_languages")
loaded_filter.has?("Esperanto") # => true
loaded_filter.has?("English") # => false
Union and intersection
Having two filters of the same size and number of hash functions, it's possible to perform union and intersection operations:
f1 = BloomFilter.new(32, 3)
f1.insert("Esperanto")
f1.insert("Spanish")
f2 = BloomFilter.new(32, 3)
f2.insert("Esperanto")
f2.insert("English")
# Union
f3 = f1 | f2
f3.has?("Esperanto") # => true
f3.has?("Spanish") # => true
f3.has?("English") # => true
# Intersection
f4 = f1 & f2
f4.has?("Esperanto") # => true
f4.has?("Spanish") # => false
f4.has?("English") # => false
Visualization
If you want to see how your filter looks like, you can visualize it:
f1 = BloomFilter.new(16, 2)
f1.insert("Esperanto")
puts "f1 = (Esperanto)"
puts f1.visualize
f2 = BloomFilter.new(16, 2)
f2.insert("Spanish")
puts "f2 = (Spanish)"
puts f2.visualize
f3 = f1 | f2
puts "f3 = f1 | f2 = (Esperanto, Spanish)"
puts f3.visualize
Output:
f1 = (Esperanto)
░░░░░░░░ ░░░░░░█░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░█ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░
f2 = (Spanish)
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░█░ ░█░░░░░░
f3 = f1 | f2 = (Esperanto, Spanish)
░░░░░░░░ ░░░░░░█░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░
░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░░ ░░░░░░░█ ░░░░░░░░ ░░░░░░█░ ░█░░░░░░
In this way, you can actually see which bits are set:)
Benchmark
Performance of Bloom filter depends on the following parameters:
- Size of the filter
- Number of hash functions
- Length of the input string
To run benchmark from ./samples/benchmark.cr
, simply run make task:
$ make benchmark
Number of items: 100000000
Filter size: 117005Kb
Hash functions: 7
String size: 13
user system total real
insert 0.004227 0.000000 0.004227 ( 2.769349)
has? (present) 0.007980 0.000000 0.007980 ( 5.223778)
has? (missing) 0.004318 0.000000 0.004318 ( 2.829521)
Contributors
- greyblake Potapov Sergey - creator, maintainer
- funny-falcon Sokolov Yura - better hash algorithms