module HClust

Overview

The HClust module provides methods for fast hierarchical agglomerative clustering featuring efficient linkage algorithms.

Cluster analysis or clustering arrange a set of objects into distinct groups or clusters such that the elements within a cluster are more similar to each other than those in other clusters based on a given criterion. The latter is defined as a measure of dissimilarity between the objects, and it's usually encoded in a DistanceMatrix. Hierarchical agglomerative clustering builds a hierarchy of clusters, where each object or element starts in its own cluster, and then they are progressively merged forming a hierarchy among them. The dissimilarities bewteen a newly-created cluster and all other clusters are updated after each merging step. The dissimilarity between two clusters is dictated by the chosen linkage criterion or rule. The obtained hierarchy is often encoded as a Dendrogram, which can be cut at a given dissimilarity value to obtain flat clusters such that the elements of a flat cluster have a cophenetic distance no greater than the given value (see Dendrogram#flatten).

The standard algorithm has a time complexity of Θ(N³), which is implemented by the .primitive method. However, optimal algorithms are provided based on the work of Daniel Müllner [1], which describes several optimizations over the standard algorithm (implemented by the .generic method), and fast methods for special cases such as the minimum spanning tree (MST) method (.mst) for single linkage and nearest-neighbor-chain (NN-chain) method (.nn_chain) for average, complete, Ward's, and weighted linkage. Thus, the best-case complexity can be reduced to Θ(N²), where, in practice, the runtime of the general case is close to this.

The current implementation is heavily based on Müllner's own implementation found in the fastcluster C++ library, and parts of the implementation were also inspired by the kodama Rust crate (code for updating distances) and SciPy Python library (code for generating flat clusters).

The most relevant types and methods for practical usage are the following:

The available linkage rules are:

See Rule documentation for details.

Clustering example

Let's first define a list of random coordinates in 3D space as an example case:

coords = [
  [-0.30818828, 2.70462841, 1.84344886],
  [2.9666203, -1.39874721, 4.76223947],
  [3.21737027, 4.09489028, -4.60403434],
  [-3.51140292, -0.83953645, 2.31887739],
  [2.08457843, 4.24960773, -3.91378835],
  [2.88992367, -0.97659082, 0.75464131],
  [0.43808545, 3.70042294, 4.99126146],
  [-1.71676206, 4.93399583, 0.27392482],
  [1.12130963, -1.09646418, 1.45833231],
  [-3.45524705, 0.92812111, 0.15155981],
]
labels = (0...coords.size).to_a # for demonstration purposes

We'd like to group them by the proximity to each other defined by Euclidean distance:

def euclidean(u, v)
  Math.sqrt (0...u.size).sum { |i| (u[i] - v[i])**2 }
end

Let's say we want to split the coordinates into groups with a distance no more than 4. The easiest way is to use the .cluster convenience method:

clusters = HClust.cluster(labels, 4) { |i, j| euclidean(coords[u], coords[v]) }
clusters.size # => 3
clusters      # => [[0, 3, 6, 7, 9], [1, 5, 8], [2, 4]]

The method receives the elements to be clustered, a distance cutoff, and a block that receives two of the elements and must return the dissimilarity between them. We observe that the coordinates can be grouped into 3 distinct clusters containing 5, 3, and 2 elements, respectively. The order of the clusters is arbitrary and depends on the order of the elements.

Alternatively, one can set the maximum number of clusters to be generated instead of a distance cutoff using the named argument into::

clusters = HClust.cluster(labels, into: 2) { |i, j| euclidean(coords[i], coords[j]) }
clusters.size # => 2
clusters      # => [[0, 1, 3, 5, 6, 7, 8, 9], [2, 4]]

As stated above, the linkage rule dictates how the dissimilarities are updated upon merging two clusters during the clustering procedure. Therefore, the clustering output will depend on the selected rule. In turn, the choice of the linkage rule depends both on the problem domain and performance requirements. By default, the single linkage is used throughout the code but it can be given as the third (optional) argument to the cluster methods if required.

clusters = HClust.cluster(labels, 2, :centroid) { |i, j|
  euclidean(coords[i], coords[j])
}
clusters.size # => 5
clusters      # => [[0, 7], [1, 5, 8], [2, 4], [3, 9], [6]]

Note the different number of clusters and composition obtained with the centroid linkage comparted to the previous result using the single linkage.

Advanced usage

Using the .cluster methods is enough in most cases, albeit obtaining the dendrogram can be useful for visualization purposes or testing different clustering arguments without recomputing the dissimilarities.

Under the hood, the .cluster methods construct a DistanceMatrix with the given block, invoke the .linkage method, and then call Dendrogram#flatten on the obtained dendrogram with the given argument. The latter returns an array of clusters, each containing a list of indexes that can be used to fetch the original elements. The equivalent code to the above example would be:

dism = DistanceMatrix.new(coords.size) { |i, j| euclidean(coords[i], coords[j]) }
dendrogram = linkage(dism)
clusters = dendrogram.flatten(4)

The dendrogram represents the hierarchy as a series of merge steps, which contain the merged clusters and computed dissimilarity. For instance, let's see what the dendrogram looks like:

pp dendrogram.steps.map { |s| [s.clusters[0], s.clusters[1], s.distance, 0.0] }

will print:

# cluster1, cluster2, dissimilarity, size (unused)
[[2, 4, 1.3355127737375514, 0.0],
 [5, 8, 1.9072352420201895, 0.0],
 [3, 9, 2.7973259058854163, 0.0],
 [0, 7, 3.068805125647255, 0.0],
 [6, 13, 3.384856775520168, 0.0],
 [1, 11, 3.796359969884454, 0.0],
 [12, 14, 3.9902939298123274, 0.0],
 [15, 16, 4.079225895869359, 0.0],
 [10, 17, 5.696974476555648, 0.0]]

The output can be copied into a Python terminal and visualized using the dendrogram function in SciPy or similar software. It would look something like:

  |       ________________
5 |      |                |
  |      |                |
4 |      |         _______|_______
  |      |      __|__         ____|___
3 |      |     |     |       |      __|__
  |      |     |     |      _|_    |    _|_
2 |      |     |    _|_    |   |   |   |   |
  |      |     |   |   |   |   |   |   |   |
1 |     _|_    |   |   |   |   |   |   |   |
  |    |   |   |   |   |   |   |   |   |   |
0 |    2   4   1   5   8   3   9   6   0   7

Using this graph, one can deduce the optimal cutoff for generating flat clusters, where a cutoff = 4 would produce indeed two clusters as shown above.

Defined in:

hclust.cr
hclust/dendrogram.cr
hclust/version.cr

Constant Summary

VERSION = {{ (`shards version \"/srv/crystaldoc.info/github-franciscoadasme-hclust-v1.0.0/src/hclust\"`).chomp.stringify }}

Class Method Summary

Class Method Detail

def self.cluster(elements : Indexable(T), cutoff : Number, rule : Rule = :single, & : T, T -> Float64) : Array(Array(T)) forall T #

Clusters elements using the linkage rule rule based on the distances computed by the given block. The clusters are generated such that the cophenetic distance between any two elements in a cluster is less than or equal to cutoff.


[View source]
def self.cluster(elements : Indexable(T), *, into count : Int, rule : Rule = :single, & : T, T -> Float64) : Array(Array(T)) forall T #

Clusters elements into count clusters or fewer using the linkage rule rule based on the distances computed by the given block.


[View source]
def self.generic(dism : DistanceMatrix, rule : Rule) : Dendrogram #

Perform hierarchical clustering based on the distances stored in dism using a fast generic linkage algorithm with the given linkage rule.

The so-called generic algorithm published by Müllner, D. [1] includes several optimizations over the classic hierarchical clustering algorithm, reducing the best-case complexity from Θ(N³) to Θ(N²). In practice, it is considerably faster than the standard method. This is mainly due to the algorithm keeps track of the nearest neighbors of clusters in a priority queue to speed up the repeated minimum searches.

This algorithm can deal with inversions in the dendrogram, so it can be used with any linkage rule including Rule::Centroid and Rule::Median.

The merge steps are encoded as an unordered Dendrogram, which is sorted prior to be returned.

The current implementation is described in section 3.1 of the Müllner's article [1].

NOTE Prefer to use the .linkage method since it provides a general interface and picks the best algorithm depending on the linkage rule.


[View source]
def self.linkage(dism : DistanceMatrix, rule : Rule, reuse : Bool = false) : Dendrogram #

Returns the hierarchical clustering based on the pairwise distances dism using the linkage rule rule.

This method simply selects and invokes the optimal algorithm based on the given linkage rule as follows:

If reuse is true, the distance matrix dism will be forwarded directly to the underlying method, and be potentially modified. If reuse is false, a copy will be created first and then forwarded. This can be used to prevent a potentially large memory allocation when the distance matrix will not be used after clustering.


[View source]
def self.mst(dism : DistanceMatrix) : Dendrogram #

Perform hierarchical clustering based on the distances stored in dism using the minimum spanning tree (MST) algorithm.

The MST algorithm keeps track of the distances to the nearest neighbor for each cluster after every merge step, which leads to a significant speed up as obtaining the next pair of nearest clusters is very efficient. By definition, this algorithm can only be used with the Rule::Single linkage rule.

The merge steps are encoded as an unordered Dendrogram, which is sorted prior to be returned.

The current implementation is described in section 3.3 of the Müllner's article [1], which includes several optimizations over the classic implementation.

NOTE Prefer to use the .linkage method since it provides a general interface and picks the best algorithm depending on the linkage rule.


[View source]
def self.nn_chain(dism : DistanceMatrix, rule : ChainRule) : Dendrogram #

Perform hierarchical clustering based on the distances stored in dism using the nearest-neighbor-chain (NN-chain) algorithm with the given linkage rule.

The NN-chain algorithm follows paths in the nearest neighbor graph of the clusters to find a pair of clusters that are nearest neighbors of each other, which are merged into a new cluster. The algorithm uses a stack data structure (Deque) to store the paths, which can lead to a speed up by re-using parts of the existing path efficiently.

By definition, the NN-chain algorithm can only be used with the following linkage rules: Rule::Single, Rule::Complete, Rule::Average, Rule::Weighted, and Rule::Ward. Consequently, this method accepts a ChainRule enum (not Rule), which only contains these methods to ensure safety during compilation.

The merge steps are encoded as an unordered Dendrogram, which is sorted prior to be returned.

The current implementation is described in section 3.2 of the Müllner's article [1].

NOTE Prefer to use the .linkage method since it provides a general interface and picks the best algorithm depending on the linkage rule.


[View source]
def self.primitive(dism : DistanceMatrix, rule : Rule) : Dendrogram #

Perform hierarchical clustering based on the distances stored in dism using the classic algorithm with the given linkage rule.

The classic or primitive algorithm iteratively finds a pair of clusters that are nearest neighbors of each other, which are merged into a new cluster. Then, the distances stored in dism are updated by computing the distances to the newly-created cluster according to the linkage rule. The procedure is repeated for N - 1 times, where N is the number of elements or observations. Since all pairwise distances are searched in each iteration, the time complexity of this algorithm is Θ(N³).

This algorithm can deal with inversions in the dendrogram, so it can be used with any linkage rule including Rule::Centroid and Rule::Median.

The merge steps are encoded as an unordered Dendrogram, which is sorted prior to be returned.

The current implementation is described in section 2.4 of the Müllner's article [1].

WARNING This method is painfully slow and should not be used for production. It is only used as reference to test other methods. Prefer to use the .linkage method since it provides a general interface and picks the best algorithm depending on the linkage rule.


[View source]