Fast hierarchical clustering algorithms in pure Crystal

Made with Crystal CI status Docs status Version License

This shard provides types and methods for fast hierarchical agglomerative clustering featuring efficient linkage algorithms.

The current implementation is heavily based on the work of Daniel Müllner [1] and derived from Müllner's own implementation found in the fastcluster C++ library [2], and parts of the implementation were also inspired by the kodama Rust crate (code for updating distances) and SciPy Python library (code for generating flat clusters).

The runtime performance of this library is on par with the reference implementations (see benchmark).

The most relevant types and methods for practical usage are the following:

The available linkage rules are:

This library is released under the MIT license.

Installation

  1. Add the dependency to your shard.yml:

    dependencies:
      hclust:
        github: franciscoadasme/hclust
  2. Run shards install

Usage

First define the data points to be clustered:

coords = [
  [-0.30818828, 2.70462841, 1.84344886],
  [2.9666203, -1.39874721, 4.76223947],
  [3.21737027, 4.09489028, -4.60403434],
  [-3.51140292, -0.83953645, 2.31887739],
  [2.08457843, 4.24960773, -3.91378835],
  [2.88992367, -0.97659082, 0.75464131],
  [0.43808545, 3.70042294, 4.99126146],
  [-1.71676206, 4.93399583, 0.27392482],
  [1.12130963, -1.09646418, 1.45833231],
  [-3.45524705, 0.92812111, 0.15155981],
]
labels = (0...coords.size).to_a # for demonstration purposes

The easiest way is to use the convenience method #cluster. The following code will cluster the data points into groups based on the Euclidean distance with a distance cutoff of 4 using the single linkage (default so can be omitted):

require "hclust"

clusters = HClust.cluster(labels, 4, :single) { |i, j|
  Math.sqrt (0...u.size).sum { |i| (u[i] - v[i])**2 } # euclidean distance
}
clusters.size # => 3
clusters      # => [[0, 3, 6, 7, 9], [1, 5, 8], [2, 4]]

Use the into: named argument to limit the number of clusters:

clusters = HClust.cluster(labels, into: 2) { |i, j|
  Math.sqrt (0...u.size).sum { |i| (u[i] - v[i])**2 } # euclidean distance
}
clusters.size # => 2
clusters      # => [[0, 1, 3, 5, 6, 7, 8, 9], [2, 4]]

Alternatively, the procedure can be replicated by doing each step manually:

dism = DistanceMatrix.new(coords.size) { |i, j|
  Math.sqrt (0...u.size).sum { |i| (u[i] - v[i])**2 } # euclidean distance
}
dendrogram = linkage(dism, :single)
clusters = dendrogram.flatten(height: 4)
clusters.size # => 2
clusters      # => [[0, 1, 3, 5, 6, 7, 8, 9], [2, 4]]

The latter can be useful to avoid recomputing the recomputing the dissimilarities when testing different clustering arguments, or obtaining the dendrogram can be useful for visual inspection.

Refer to the API documentation for further details.

Benchmark

A Bash script is used to benchmark the code and be compared against reference implementations. Run the following command on the project folder:

bash bench/bench.sh

This will download the required libraries (except for SciPy which must be available in the current Python environment) temporary to run the corresponding code. Each benchmark will run for a number of times, and the best time will be printed out.

The expected output was tested on a AMD® Ryzen 9 5950x under Pop!_OS 22.04 LTS with default values:

| name | version | compiler | time (ms) | | ------------ | ------- | ------------ | --------- | | fastcluster | 1.2.6 | 11.2.0 (gcc) | 0.032 | | kodama | 0.2.3 | 1.61.0 | 0.041 | | scipy | 1.7.3 | 3.9.12 | 0.094 | | hclust | 0.1.0 | 1.4.1 | 0.067 |

The benchmark can be configured via environment variables:

Contributing

  1. Fork it (https://github.com/franciscoadasme/hclust/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Contributors