## C++ fast hierarchical clustering algorithms

This is a simplified C++ interface to the fast implementations of hierarchical clustering by Daniel Müllner. The original library with interfaces to R and Python can be found on danifold.net and is described in:

- Daniel Müllner:
*fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python.*Journal of Statistical Software 53, no. 9, pp. 1-18 (2013)

#### About

Hierarchical, or agglomerative clustering is a powerful technique for partitioning a set of observables. In contrast to clustering schemes like K-means, hierarchical clustering does not require the observables to be members of a vector space, but it works on a distance matrix and is thus applicable to arbitrary observables for which a distance metric can be defined.

Daniel Müllner has compared the performance of different hierarchical
clustering algorithms and implemented the fastest of them in C++ with R
and Python interfaces. Whilst these interfaces are described in his
R journal article, direct
use of the underlying C++ functions is tricky and undocumented.
To simplify this, I have written a C++ interface that hides the intricacies
of the internal output formats behind a single function (*hclust_fast*)
and provides two simple functions for the actual partitioning step
(*cutree_k* and *cutree_cdist*).

#### Usage

A complete usage example can be found in the file *demo.cpp*
in the source package. See the file *README* for details.

The interface for the basic clustering algorithms requires a condensed
distance matrix, which is the upper triangle (without the diagonal elements)
of the full distance matrix. Here is an example for its construction
where *n* is the number of observables *x*:

double* distmat = new double[(n*(n-1))/2]; int k,i,j; for (i=k=0; i<n; i++) { for (j=i+1; j<n; j++) { // compute distance between observables i and j distmat[k] = distance(x[i], x[j]); k++; } }

The actual clustering is done by the function *hclust_fast*, which
supports four methods for defining a cluster distance from individual
distances (*HCLUST_METHOD_SINGLE, HCLUST_METHOD_COMPLETE,
HCLUST_METHOD_AVERAGE, HCLUST_METHOD_MEDIAN*):

int* merge = new int[2*(n-1)]; double* height = new double[n-1]; hclust_fast(n, distmat, HCLUST_METHOD_SINGLE, merge, height);

*height* is filled with the cluster distance for each step. This
can be useful for automatically determining the clustering break point,
e.g, with the "elbow method".
*merge* contains the dendrogram in the encoding of the R
function *hclust*. Fortunately, you do not need to understand this
encoding, because there are two function for further processing the output:

int* labels = new int[n]; // partitioning into nclust clusters cutree_k(n, merge, nclust, labels); // stop clustering at step with custer distance >= cdist cutree_cdist(npoints, merge, height, cdist, labels);

*labels[i]* is filled with the cluster label
of observable *x[i]*. Cluster labels start at zero.
Final note: do not forget to free the memory after you are done with
the variables:

delete[] distmat; delete[] merge; delete[] height; delete[] labels;

#### Download

A copy of the source code of the library together with a demo program, sample data, and a script for visualization of the clustering result can be found on Github. Beware that Github blocks users from some countries. As an alternative, you can download the latest file release of the complete package:

- hclust-cpp.tgz (version 1.2 from 2020/01/13)

#### Authors and license

The fastcluster algorithm implementation is copyrighted by Daniel Müllner. The simplified C++ interface and the sample program is copyrighted by Christoph Dalitz. The complete code can be freely used under the terms a two-clause BSD-style license. See the file LICENSE in the source package for details.