# Sequential K-means¶

Sequential K-means clustering is a modification to the traditional K-means method where the algorithm acts on subsets of supplied data sequentially rather than in a single batch. This is important in two ways:

• Clusters and and their associated centroids can be discerned without loading an entire dataset into memory
• Streaming/online updates to a model can be made, letting them react to changes in the supplied data

Both traditional and sequential K-means are initialized in the same manner – using K-means++ or random assignment.

Within sequential K-means, a sample point $$x_t$$ is used to update the closest cluster center from a value $$c_{t-1}$$ to $$c_t$$ using the formula:

$c_t=c_{t-1}+a(x_t-c_{t-1})$

Where $$a$$ is the learning rate (0<$$a$$<1) defining how quickly a model changes based on new information.

The $$a$$ value above controls the rate at which the concept of forgetfulness within the algorithm is applied. Forgetful K-means is a method by which K-means clustering can be modified to account for changes to the position of clusters over time. Unlike traditional K-means, the introduction of a forgetful component allows the model to evolve and react to new information. This is a balance between prioritizing recent information vs the past.

The forgetful option provided with this implementation controls how this algorithm operates, if set to true the learning rate $$a$$ is used in the above formula as a floating-point value. If, however, forgetful is set to false then the value $$a$$ in the above formula is set to $$\frac{1}{n+1}$$, where $$n$$ is the number of points in cluster $$c_{t-1}$$.

In the implementation below, the algorithm operates by default as a forgetful implementation with a learning rate of 0.1.

## .ml.online.clust.sequentialKMeans.fit¶

Fit a Sequential K-means model

.ml.online.clust.sequentialKMeans.fit[X;df;k;centers;config]


Where

• X is the input/training data of $$N$$ dimensions
• df what distance function is used to calculate cluster-centroid distance
• k is the number of expected cluster centers
• centers are the initial cluster centers to be used: two options are supported.
• If :: then initial centers are calculated using k++/random initialization
• A dictionary: num and centroids define the number of points in a cluster and the cluster location often calculated from a previous ‘fit’ phase.
• config is the configurable dictionary defining any modifications to be applied during the fitting process of Sequential K-means

returns a dictionary containing all information collected during the fitting of a model, along with prediction and update functionality.

The information collected during the fitting of the model are contained within modelInfo key and includes:

num       | number of data points in each cluster
centroids | cluster centers associated with each cluster
inputs    | configuration inputs used when fitting the model


Prediction functionality is contained within the predict key. The function takes as argument the feature data which is to be predicted, and returns the predicted values.

An update function within the update key takes as argument the feature data with which the cluster centers are to be updated, and returns an updated representation of the original model.

Fit a vanilla sequential K-means algorithm:

// Generate a dataset
q)data:([]1000?10f;1000?10f)

// Select random sample of points to pass to K-means
q)sample:neg[500]?data

// Set parameters
q)df:e2dist           // Distance function - Euclidean squared
q)k:3                  // Number of clusters
q)centers:(::)         // Cluster information

// Fit K-means based on sample
q)show model:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]
modelInfo| numcentroidsinputs!(170 180 150;(5.59595 2.110237 6.841375;8.10..
predict  | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
data:clust...
update   | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
inputs:mode..

// Calculated centroids
q)model[modelInfo]centroids
5.59595  2.110237 6.841375
8.104928 4.878808 3.219346

// Use the fitted model to predict on new data
q)newData:2 50#100?10.
q)model.predict newData
2 1 1 1 1 1 0 1 0 1 1 0 0 0 2 0 0 0 0 2 1 1 1 2 1 0 0 1 0 1 1 0 0 2 1 1 0 2 1..

// Update the fitted model with new data
q)show updModel:model.update newData
modelInfo| numcentroidsinputs!(183 199 168;(6.092505 1.509578 6.613747;8.5..
predict  | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
data:clust...
update   | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
inputs:mode..

// Updated centroids
q)updModel[modelInfo]centroids
6.092505 1.509578 6.613747
8.509002 5.429571 2.729484


Fit a sequential K-means algorithm providing initial cluster centers and using the non-forgetful version of the algorithm:

// Generate a dataset
q)data:([]1000?10f;1000?10f)

// Select random sample of points to pass to K-means
q)sample:neg[500]?data

// Set parameters
q)df:e2dist                              // Distance function - Euclidean squared
q)k:3                                     // Number of clusters
q)centers:neg[1]_model[modelInfo]        // Cluster centers from previous fitting
q)config :enlist[forgetful]!enlist 0b    // Additional parameters

// Fit K-means based on sample
q)show model1:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]
modelInfo| numcentroidsinputs!(337 353 310;(5.874493 2.054478 6.938887;8.1..
predict  | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
data:clust...
update   | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
inputs:mode..

// Calculated centroids
q)model1[modelInfo]centroids
5.874493 2.054478 6.938887
8.124769 4.845081 3.034611

// Use the fitted model to predict on new data
q)newData:2 50#100?10.
q)model1.predict newData
0 2 2 2 0 2 1 0 1 0 1 1 0 1 2 1 2 0 1 0 2 1 2 1 1 1 1 0 1 1 1 2 1 2 1 1 1 2 2..

// Update the fitted model with new data
q)show updModel1:model1.update newData
modelInfo| numcentroidsinputs!(350 376 324;(5.896282 2.03924 6.971541;8.14..
predict  | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
data:clust...
update   | {[returnInfo;data]
modelInfo:returnInfomodelInfo;
inputs:mode..


## Configurable parameters¶

In the above function, the following are the optional entries for config:

key type default description
init boolean 1b How to initialize cluster centroids - K-means++ (1b) or randomization (0b)
a float 0.1 Learning rate - value between 0-1 used to define how much past centroid information to retain
forgetful boolean 1b Apply forgetful (1b) or normal sequential K Means (0b)