@stdlib/ml-incr-kmeans

Incrementally partition data into `k` clusters.

Usage no npm install needed!

<script type="module">
  import stdlibMlIncrKmeans from 'https://cdn.skypack.dev/@stdlib/ml-incr-kmeans';
</script>

README

incrkmeans

NPM version Build Status Coverage Status dependencies

Incrementally partition data into k clusters.

Installation

npm install @stdlib/ml-incr-kmeans

Usage

var incrkmeans = require( '@stdlib/ml-incr-kmeans' );

incrkmeans( k[, ndims][, options] )

Returns an accumulator function which incrementally partitions k clusters.

// Create an accumulator for partitioning 2-dimensional data into 5 clusters:
var accumulator = incrkmeans( 5, 2 );

To specify initial centroids, provide a 2-dimensional k-by-ndims ndarray containing centroid locations.

var Float64Array = require( '@stdlib/array-float64' );
var ndarray = require( '@stdlib/ndarray-ctor' );

// Specify initial centroids:
var buffer = new Float64Array([
    0.0, 0.0,
    1.0, 1.0,
    1.0, -1.0,
    -1.0, -1.0,
    -1.0, 1.0
]);
var shape = [ 5, 2 ];
var strides = [ 2, 1 ];

var centroids = ndarray( 'float64', buffer, shape, strides, 0, 'row-major' );

// Create an accumulator for partitioning 2-dimensional data into 5 clusters:
var accumulator = incrkmeans( centroids );

The function accepts the following options:

  • metric: distance metric. Must be one of the following:

    • 'euclidean': Euclidean distance (default).
    • 'cosine': cosine distance.
    • 'correlation': correlation distance.
  • init: an array containing the centroid initialization method and associated (optional) parameters. The first array element specifies the initialization method and must be one of the following:

    • 'kmeans++': k-means++ initialization (default).
    • 'sample': randomly sample from a specified number of data points.
    • 'forgy': randomly assign data points to one of k clusters and compute cluster centroids.

    The second array element specifies the number of data points to use when calculating initial centroids. When performing kmeans++ initialization, the third array element specifies the number of trials to perform when randomly selecting candidate centroids. Typically, more trials is correlated with initial centroids which lead to better clustering; however, a greater number of trials increases computational overhead. Default: [ 'kmeans++', k, 2+⌊ln(k)⌋ ].

  • normalize: boolean indicating whether to normalize incoming data. This option is only relevant for non-Euclidean distance metrics. If set to true, an accumulator partitioning data based on cosine distance normalizes provided data to unit Euclidean length. If set to true, an accumulator partitioning data based on correlation distance first centers provided data and then normalizes data dimensions to have zero mean and unit variance. If this option is set to false and the metric is either cosine or correlation distance, then incoming data must already be normalized. Default: true.

  • copy: boolean indicating whether to copy incoming data to prevent mutation during normalization. Default: true.

  • seed: PRNG seed. Setting this option is useful when wanting reproducible centroid initialization.

accumulator( [vector] )

If provided a data point vector, the accumulator function returns updated cluster results. If not provided a data point vector, the accumulator function returns the current cluster results.

var Float64Array = require( '@stdlib/array-float64' );
var ndarray = require( '@stdlib/ndarray-ctor' );

// Create a data vector:
var buffer = new Float64Array( 2 );
var shape = [ 2 ];
var strides = [ 1 ];
var vec = ndarray( 'float64', buffer, shape, strides, 0, 'row-major' );

// Create an accumulator for partitioning 2-dimensional data into 5 clusters:
var accumulator = incrkmeans( 5, 2 );

// Provide data to the accumulator:
vec.set( 0, 2.0 );
vec.set( 1, 1.0 );

var out = accumulator( vec );
// e.g., returns {...}

vec.set( 0, 1.0 );
vec.set( 1, -5.0 );

out = accumulator( vec );
// e.g., returns {...}

vec.set( 0, 3.0 );
vec.set( 1, 3.14 );

out = accumulator( vec );
// e.g., returns {...}

out = accumulator();
// e.g., returns {...}

If not provided initial centroids, an accumulator caches data point vectors for subsequent centroid initialization. Until an accumulator computes initial centroids, an accumulator returns null. Once an accumulator has initial centroids (either provided or computed), an accumulator returns cluster results.

Cluster results are comprised of the following:

  • centroids: a k-by-ndims matrix containing centroid locations. Each centroid is the component-wise mean of the data points assigned to a centroid's corresponding cluster.
  • stats: a k-by-4 matrix containing cluster statistics.

Cluster statistics consists of the following columns:

  • 0: number of data points assigned to a cluster.
  • 1: total within-cluster sum of squared distances.
  • 2: arithmetic mean of squared distances.
  • 3: corrected sample standard deviation of squared distances.

accumulator.predict( [out,] X )

Predicts centroid assignment for each data point in a provided matrix X.

var Float64Array = require( '@stdlib/array-float64' );
var ndarray = require( '@stdlib/ndarray-ctor' );

// Create a data vector:
var buffer = new Float64Array( 2 );
var shape = [ 2 ];
var strides = [ 1 ];
var vec = ndarray( 'float64', buffer, shape, strides, 0, 'row-major' );

// Create an accumulator for partitioning 2-dimensional into 2 clusters:
var accumulator = incrkmeans( 2, 2, {
    'init': [ 'sample', 2 ]
});

// Provide data to the accumulator:
vec.set( 0, 2.0 );
vec.set( 1, 1.0 );
accumulator( vec );

vec.set( 0, 1.0 );
vec.set( 1, -5.0 );
accumulator( vec );

vec.set( 0, 3.0 );
vec.set( 1, 3.14 );
accumulator( vec );

// Create a matrix containing the data points for which we want to predict cluster assignment:
buffer = new Float64Array( 4 );
shape = [ 2, 2 ];
strides = [ 2, 1 ];
var mat = ndarray( 'float64', buffer, shape, strides, 0, 'row-major' );

mat.set( 0, 0, 0.0 );
mat.set( 0, 1, 0.0 );

mat.set( 1, 0, 0.5 );
mat.set( 1, 1, -0.5 );

var out = accumulator.predict( mat );
// returns <ndarray>

To specify an output vector, provide a 1-dimensional ndarray as the first argument. Each element in the returned vector corresponds to a predicted cluster index for a respective data point.

Notes

  • Because an accumulator incrementally partitions data, one should not expect cluster statistics to match similar statistics had provided data been analyzed via a batch algorithm. In an incremental context, data points which would not be considered part of a particular cluster when analyzed via a batch algorithm may contribute to that cluster's statistics when analyzed incrementally. In general, the more data provided to an accumulator, the more reliable the cluster statistics.
  • Forgy's method for centroid initialization is generally discouraged, as the method generates initial clusters without internal homogeneity and no theoretical basis. The method's inclusion is due to its historical usage.

Examples

var discreteUniform = require( '@stdlib/random-base-discrete-uniform' );
var normal = require( '@stdlib/random-base-normal' ).factory;
var ndarray = require( '@stdlib/ndarray-ctor' );
var Float64Array = require( '@stdlib/array-float64' );
var Int8Array = require( '@stdlib/array-int8' );
var incrkmeans = require( '@stdlib/ml-incr-kmeans' );

// Define the number of data points to simulate:
var N = 1e4;

// Define the number of clusters:
var k = 5;

// Define cluster properties:
var clusters = new Float64Array([
    0.0, 1.0, 0.0, 1.0, // meanX, stdevX, meanY, stdevY
    -5.0, 1.0, 5.0, 1.0,
    5.0, 1.0, 5.0, 1.0,
    5.0, 1.0, -5.0, 1.0,
    -5.0, 1.0, -5.0, 1.0
]);
clusters = ndarray( 'float64', clusters, [ k, 4 ], [ 4, 1 ], 0, 'row-major' );

// Define accumulator options:
var opts = {
    'metric': 'euclidean',
    'init': [ 'kmeans++', 100 ]
};

// Initialize a 2-dimensional k-means accumulator:
var acc = incrkmeans( k, 2, opts );

// Create PRNGs for generating pseudorandom numbers drawn from 2-d uncorrelated normal distributions...
var randn = ndarray( 'generic', new Array( k*2 ), [ k, 2 ], [ 2, 1 ], 0, 'row-major' );
var i;
for ( i = 0; i < k; i++ ) {
    randn.set( i, 0, normal( clusters.get( i, 0 ), clusters.get( i, 1 ) ) );
    randn.set( i, 1, normal( clusters.get( i, 2 ), clusters.get( i, 3 ) ) );
}

// Create a vector for storing simulated data:
var v = ndarray( 'float64', new Float64Array( 2 ), [ 2 ], [ 1 ], 0, 'row-major' );

// Wrap the vector in a matrix for generating cluster predictions:
var m = ndarray( 'float64', v.data, [ 1, 2 ], [ 2, 1 ], 0, 'row-major' );

// Create a vector for storing cluster predictions:
var p = ndarray( 'int8', new Int8Array( 1 ), [ 1 ], [ 1 ], 0, 'row-major' );

// Simulate data points and incrementally perform k-means clustering...
var totals = [ 0, 0, 0, 0, 0 ];
var X = [];
var Y = [];
for ( i = 0; i < k; i++ ) {
    X.push( [] );
    Y.push( [] );
}

var results;
var x;
var y;
var c;
var r;
for ( i = 0; i < N; i++ ) {
    // Pick a random cluster from which to sample:
    c = discreteUniform( 0, k-1 );
    totals[ c ] += 1;

    // Generate a random cluster data point:
    x = randn.get( c, 0 )();
    v.set( 0, x );
    X[ c ].push( x );

    y = randn.get( c, 1 )();
    v.set( 1, y );
    Y[ c ].push( y );

    // Generate a cluster prediction:
    r = acc.predict( p, m );
    if ( r ) {
        console.log( 'Data point: (%d, %d). Prediction: %d.', x.toFixed( 3 ), y.toFixed( 3 ), r.get( 0 )+1 );
    }
    // Update the accumulator:
    results = acc( v );
}

// Print cluster results:
results = acc();
if ( results ) {
    console.log( '' );
    for ( i = 0; i < k; i++ ) {
        console.log( 'Cluster %d', i+1 );
        console.log( '  centroid: (%d, %d)', results.centroids.get( i, 0 ), results.centroids.get( i, 1 ) );
        console.log( '  size: %d', results.stats.get( i, 0 ) );
    }
    console.log( '' );
}

console.log( '' );
console.log( 'True cluster distribution: %s', totals.join( ', ' ) );
console.log( '' );

References

  • Forgy, E. 1965. "Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification." Biometrics 21 (3): 768–69.
  • MacQueen, J. 1967. "Some methods for classification and analysis of multivariate observations." In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–97. Berkeley, California, USA: University of California Press. https://projecteuclid.org/euclid.bsmsp/1200512992.
  • Lloyd, S. 1982. "Least Squares Quantization in PCM." IEEE Transactions on Information Theory 28 (2). Piscataway, NJ, USA: IEEE Press: 129–37. doi:10.1109/TIT.1982.1056489.
  • Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. http://dl.acm.org/citation.cfm?id=1283383.1283494.

Notice

This package is part of stdlib, a standard library for JavaScript and Node.js, with an emphasis on numerical and scientific computing. The library provides a collection of robust, high performance libraries for mathematics, statistics, streams, utilities, and more.

For more information on the project, filing bug reports and feature requests, and guidance on how to develop stdlib, see the main project repository.

Community

Chat


License

See LICENSE.

Copyright

Copyright © 2016-2021. The Stdlib Authors.