hac

Hierarchical agglomerative clustering

Usage no npm install needed!

<script type="module">
  import hac from 'https://cdn.skypack.dev/hac';
</script>

README

HAC

HAC stands for Hierarchical Agglomerative Clustering, a commeon technique for unsupervised document clustering.

NOTICE: HAC requires unpublished modules on github, it will just work fine with npm install, but will fail on Tonic (the Try it out on npm website), since it requires all modules published on npm. Future works will try to publish these required modules on npm.

Installation

npm install hac --save

Usage

Instantiate

var HAC = require("hac");
var hac = new HAC();

Add documents

hac.addDocument(doc, id, class);

Arguments:

  • doc String: the document to be added, could be string of text or array of terms
  • id String/int (optional): the id of the docuemnt. If ignored, a uuid would generated automatically
  • class String/int (optional): the class(or label) of this document. You probably won't need this, but if specified, you could use getMeasure() to get F measure or Randon Index to see clustering performance.

Clustering

hac.cluster(clusterMethod);

Arguments:

  • clusterMethod Class Method: the clustering algorithm to be used. Available options are as following:
    • HAC.GA: Group-average Agglomerative clustering
    • HAC.SingleLink: single link clustering
    • HAC.CompleteLink: complete link clustering
    • HAC.Centroid: centroid clustering. To Be Implemented

Get clustering result

var clusters = hac.getClusters(k, fields);

Arguments:

  • k int: the number of clusters
  • fields Array: array of fields of a document that you want in the final clustering result. Available fields are as following:
    • "id": the id of the document
    • "class": the class(label) of the document, if specified when calling addDocument()
    • "content": string of document content
    • "terms": document content represented as array of terms
    • "tfs": array of term frequencies for this document
    • "vector": vector representation of this document

Alternatively, you could use following method to get clusters with cluster labeling:

var clusters = hac.getClustersWithLabels(k, fields, featureCount, featureMethod);

The cluster labeling algorithm uses feature selection, which is a module called FeatureSelector.

Arguments:

  • k int: number of clusters.
  • fields Array: array of fields. see above description of getClusters()
  • featureCount int: the number of feature terms that you want for each cluster
  • featureMethod Class Method: the feature selection algorithm to be used. Available options are as following:;
    • FeatureSelector.MI: Expected Mutual Information feature selection
    • FeatureSelectr.LLR: Likelihood Ratio feature selection

Get performance measurement

You could get F measure or Random index for the clustering result.

NOTE: if you want to see performance measurements, you must specify the class argument when calling addDocument(). Also, when calling getClusters() or getClustersWithLabels(), you must include the field "class" in the argment fields.

var measure = getMeasure(clusters, method, beta, showRawScore);

Arguments:

  • clusters Array: the clustering result that you get by calling getClusters() or getClustersWithLabels()
  • method Class Method: the measuring algorithm to be used. Available options are as following:
    • HAC.F: F measure
    • HAC.RI: Random Index
  • beta int (optional): If you use HAC.F, you should give hac a beta value, which should be integer greater than or equal to 1
  • showRawScore boolean (optional): If set to true, print the tp, fp, fn, tn, total negative and total positive on the console

Complete example

var hac = new HAC();
var docs = [];
docs.push(["嗨", "你好"]);
docs.push(["嗨", "很", "高興", "認識", "你"]);
docs.push("hello, how's everything today? is everything ok today?")
docs.push("let's test one more document!");
docs.push("documents are always not large enough");

for(var i = 0; i < docs.length; i++) {
    hac.addDocument(docs[i], i);
}
hac.cluster(HAC.GA);

var clusters = hac.getClusters(2, ["id", "content"]);
_.forEach(clusters, function(cluster) {
    console.log("cluster id: " + cluster.id)
    _.forEach(cluster.docs, function(doc) {
        console.log("doc id: " + doc.id)
        console.log("doc content: " + doc.content);
    })
    console.log()
})

the result would be:

cluster id: 7
doc id: 0
doc content: 嗨,你好
doc id: 1
doc content: 嗨,很,高興,認識,你
doc id: 2
doc content: hello, how's everything today? is everything ok today?

cluster id: 6
doc id: 3
doc content: let's test one more document!
doc id: 4
doc content: documents are always not large enough

Release Notes

  • 1.0.7: update url of modules hosted on github to a simpler form
  • 1.0.6: correct require path of the heap module
  • 1.0.5: make statements in README for incompatibility with Tonic
  • 1.0.4: require es6-shim to support older node engine
  • 1.0.3: change arrow functions to anonymous functions for backward compatibility
  • 1.0.2: subtle modification to README
  • 1.0.1: first publishment