Sequential K-Means

Sequential K-means clustering is a modification to the traditional K-means method where the algorithm acts upon subsets of supplied data sequentially rather than in a single batch. This is important in two capacities:

It allows clusters and their associated centroids to be discerned without loading an entire dataset into memory.
It allows streaming/online updates to a model to be made thus providing the ability to react to changes in underlying supplied data

Both traditional and sequential K-means are initialized in the same manner - utilizing K-means++ or random assignment.

Within sequential K-means, a sample point x_t is used to update the closest cluster center from a value c_t-1 to c_t using the formula:

\[c_t=c_{t-1}+a(x_t-c_{t-1})\]

Where

Parameter	Description
\(a\)	Learning rate (0<a<1) defining how quickly a model changes based on new information.

The a value above controls the rate at which the concept of forgetfulness within the algorithm is applied. Forgetful K-means describes a method by which K-means clustering can be modified to account for changes to the position of clusters over time, unlike traditional K-means the introduction of a forgetful component allows the model to evolve and react to new information. This is a balance between prioritising recent information vs the past.

The forgetful option provided with this implementation controls how this algorithm operates, if set to true 1b then the learning rate a is used in the above formula as a floating point value, if however forgetful is set to 0b then the value a in the above formula is set to 1/(n+1) where n is the number of points in cluster c_t-1.

By default in the below implementation the algorithm operates as a forgetful implementation with a learning rate of 0.1

`.ml.online.clust.sequentialKMeans.fit`

Fit a Sequential K-means model

.ml.online.clust.sequentialKMeans.fit[X;df;k;centers;config]

Parameters:

name	type	description
`X`	`any`	Input/training data of N dimensions.
`df`	`symbol`	Distance function used in clustering. This must be one of `edist` or `e2dist` denoting the euclidean distance and euclidean squared distance respectively.
`k`	`long`	The number of clusters.
`centers`	`dictionary\|null`	Initial cluster centers. If null, initial centers are calculated using k++/random initialisation. If dictionary, must contain `num` and `centroids` which define the number of points in a cluster and the cluster location often calculated from a previous 'fit' phase.
`config`	`dictionary`	Any modifications to be applied during the fitting process of Sequential K-means (See here for more details).

In the above function, the following are the optional entries for config:

name	type	description	default
`init`	boolean	Cluster centroid initialization - K-means++/randomization (`1b/0b`)	`1b`
`a`	float	Learning rate	`0.1`
`forgetful`	boolean	Forgetful/normal sequential K-Means (`1b/0b`)	`1b`

Returns:

type	description
`dictionary`	All information collected during the fitting of a model, along with prediction and update functionality.

Note that the information collected during the fitting of the model is contained within the modelInfo key and includes:

name	type	description
`num`	`long[]`	Number of data points in each cluster.
`centroids`	`float[]`	Cluster centers associated with each cluster.
`inputs`	dictionary	Configuration inputs used when fitting the model.

Prediction functionality is contained within the predict key. The function takes X, the feature data which is to be predicted, as input and returns the predicted values. While the update function is stored within the update key and again takes X as input and returns an updated representation of the original model.

Examples:

Example 1: Fit a vanilla sequential K-means algorithm

// Generate a dataset
q)data:([]1000?10f;1000?10f)

// Select random sample of points to pass to K-means
q)sample:neg[500]?data

// Set parameters
q)df:`e2dist           // Distance function - Euclidean squared
q)k:3                  // Number of clusters
q)centers:(::)         // Cluster information
q)config :(::)         // Additional parameters

// Fit K-means based on sample
q)show model:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]
modelInfo| `num`centroids`inputs!(170 180 150;(5.59595 2.110237 6.841375;8.10..
predict  | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  data:clust...
update   | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  inputs:mode..

// Calculated centroids
q)model[`modelInfo]`centroids
5.59595  2.110237 6.841375
8.104928 4.878808 3.219346

// Use the fitted model to predict on new data
q)newData:2 50#100?10.
q)model.predict newData
2 1 1 1 1 1 0 1 0 1 1 0 0 0 2 0 0 0 0 2 1 1 1 2 1 0 0 1 0 1 1 0 0 2 1 1 0 2 1..

// Update the fitted model with new data
q)show updModel:model.update newData
modelInfo| `num`centroids`inputs!(183 199 168;(6.092505 1.509578 6.613747;8.5..
predict  | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  data:clust...
update   | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  inputs:mode..

// Updated centroids
q)updModel[`modelInfo]`centroids
6.092505 1.509578 6.613747
8.509002 5.429571 2.729484

Example 2: Fit a sequential K-means algorithm providing initial cluster centers and using the non-forgetful version of the algorithm

// Generate a dataset
q)data:([]1000?10f;1000?10f)

// Select random sample of points to pass to K-means
q)sample:neg[500]?data

// Set parameters
q)df:`e2dist                              // Distance function - Euclidean squared
q)k:3                                     // Number of clusters
q)centers:neg[1]_model[`modelInfo]        // Cluster centers from previous fitting
q)config :enlist[`forgetful]!enlist 0b    // Additional parameters

// Fit K-means based on sample
q)show model1:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]
modelInfo| `num`centroids`inputs!(337 353 310;(5.874493 2.054478 6.938887;8.1..
predict  | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  data:clust...
update   | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  inputs:mode..

// Calculated centroids
q)model1[`modelInfo]`centroids
5.874493 2.054478 6.938887
8.124769 4.845081 3.034611

// Use the fitted model to predict on new data
q)newData:2 50#100?10.
q)model1.predict newData
0 2 2 2 0 2 1 0 1 0 1 1 0 1 2 1 2 0 1 0 2 1 2 1 1 1 1 0 1 1 1 2 1 2 1 1 1 2 2..

// Update the fitted model with new data
q)show updModel1:model1.update newData
modelInfo| `num`centroids`inputs!(350 376 324;(5.896282 2.03924 6.971541;8.14..
predict  | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  data:clust...
update   | {[returnInfo;data]
  modelInfo:returnInfo`modelInfo;
  inputs:mode..