Sequential K-Means
Sequential K-means clustering is a modification to the traditional K-means method where the algorithm acts upon subsets of supplied data sequentially rather than in a single batch. This is important in two capacities:
- It allows clusters and their associated centroids to be discerned without loading an entire dataset into memory.
- It allows streaming/online updates to a model to be made thus providing the ability to react to changes in underlying supplied data
Both traditional and sequential K-means are initialized in the same manner - utilizing K-means++ or random assignment.
Within sequential K-means, a sample point xt is used to update the closest cluster center from a value ct-1 to ct using the formula:
Where
Parameter | Description |
---|---|
\(a\) | Learning rate (0<a<1) defining how quickly a model changes based on new information. |
The a
value above controls the rate at which the concept of forgetfulness within the algorithm is applied. Forgetful K-means describes a method by which K-means clustering can be modified to account for changes to the position of clusters over time, unlike traditional K-means the introduction of a forgetful component allows the model to evolve and react to new information. This is a balance between prioritising recent information vs the past.
The forgetful
option provided with this implementation controls how this algorithm operates, if set to true 1b
then the learning rate a
is used in the above formula as a floating point value, if however forgetful
is set to 0b
then the value a
in the above formula is set to 1/(n+1)
where n
is the number of points in cluster ct-1.
By default in the below implementation the algorithm operates as a forgetful
implementation with a learning rate of 0.1
.ml.online.clust.sequentialKMeans.fit
Fit a Sequential K-means model
.ml.online.clust.sequentialKMeans.fit[X;df;k;centers;config]
Parameters:
name | type | description |
---|---|---|
X |
any |
Input/training data of N dimensions. |
df |
symbol |
Distance function used in clustering. This must be one of edist or e2dist denoting the euclidean distance and euclidean squared distance respectively. |
k |
long |
The number of clusters. |
centers |
dictionary|null |
Initial cluster centers. If null, initial centers are calculated using k++/random initialisation. If dictionary, must contain num and centroids which define the number of points in a cluster and the cluster location often calculated from a previous 'fit' phase. |
config |
dictionary |
Any modifications to be applied during the fitting process of Sequential K-means (See here for more details). |
In the above function, the following are the optional entries for config
:
name | type | description | default |
---|---|---|---|
init |
boolean | Cluster centroid initialization - K-means++/randomization (1b/0b ) |
1b |
a |
float | Learning rate | 0.1 |
forgetful |
boolean | Forgetful/normal sequential K-Means (1b/0b ) |
1b |
Returns:
type | description |
---|---|
dictionary |
All information collected during the fitting of a model, along with prediction and update functionality. |
Note that the information collected during the fitting of the model is contained within the modelInfo
key and includes:
name | type | description |
---|---|---|
num |
long[] |
Number of data points in each cluster. |
centroids |
float[] |
Cluster centers associated with each cluster. |
inputs |
dictionary | Configuration inputs used when fitting the model. |
Prediction functionality is contained within the predict
key. The function takes X
, the feature data which is to be predicted, as input and returns the predicted values. While the update function is stored within the update
key and again takes X
as input and returns an updated representation of the original model.
Examples:
Example 1: Fit a vanilla sequential K-means algorithm
// Generate a dataset
q)data:([]1000?10f;1000?10f)
// Select random sample of points to pass to K-means
q)sample:neg[500]?data
// Set parameters
q)df:`e2dist // Distance function - Euclidean squared
q)k:3 // Number of clusters
q)centers:(::) // Cluster information
q)config :(::) // Additional parameters
// Fit K-means based on sample
q)show model:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]
modelInfo| `num`centroids`inputs!(170 180 150;(5.59595 2.110237 6.841375;8.10..
predict | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
data:clust...
update | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
inputs:mode..
// Calculated centroids
q)model[`modelInfo]`centroids
5.59595 2.110237 6.841375
8.104928 4.878808 3.219346
// Use the fitted model to predict on new data
q)newData:2 50#100?10.
q)model.predict newData
2 1 1 1 1 1 0 1 0 1 1 0 0 0 2 0 0 0 0 2 1 1 1 2 1 0 0 1 0 1 1 0 0 2 1 1 0 2 1..
// Update the fitted model with new data
q)show updModel:model.update newData
modelInfo| `num`centroids`inputs!(183 199 168;(6.092505 1.509578 6.613747;8.5..
predict | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
data:clust...
update | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
inputs:mode..
// Updated centroids
q)updModel[`modelInfo]`centroids
6.092505 1.509578 6.613747
8.509002 5.429571 2.729484
Example 2: Fit a sequential K-means algorithm providing initial cluster centers and using the non-forgetful version of the algorithm
// Generate a dataset
q)data:([]1000?10f;1000?10f)
// Select random sample of points to pass to K-means
q)sample:neg[500]?data
// Set parameters
q)df:`e2dist // Distance function - Euclidean squared
q)k:3 // Number of clusters
q)centers:neg[1]_model[`modelInfo] // Cluster centers from previous fitting
q)config :enlist[`forgetful]!enlist 0b // Additional parameters
// Fit K-means based on sample
q)show model1:.ml.online.clust.sequentialKMeans.fit[sample;df;k;centers;config]
modelInfo| `num`centroids`inputs!(337 353 310;(5.874493 2.054478 6.938887;8.1..
predict | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
data:clust...
update | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
inputs:mode..
// Calculated centroids
q)model1[`modelInfo]`centroids
5.874493 2.054478 6.938887
8.124769 4.845081 3.034611
// Use the fitted model to predict on new data
q)newData:2 50#100?10.
q)model1.predict newData
0 2 2 2 0 2 1 0 1 0 1 1 0 1 2 1 2 0 1 0 2 1 2 1 1 1 1 0 1 1 1 2 1 2 1 1 1 2 2..
// Update the fitted model with new data
q)show updModel1:model1.update newData
modelInfo| `num`centroids`inputs!(350 376 324;(5.896282 2.03924 6.971541;8.14..
predict | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
data:clust...
update | {[returnInfo;data]
modelInfo:returnInfo`modelInfo;
inputs:mode..