Clustering algorithms¶
.ml.clust Clustering
Algorithms
Affinity Propagation (AP): ap.fit Fit AP algorithm
Clustering Using REpresentatives (CURE): cure.fit Fit CURE algorithm cure.fitPredict Fit CURE algorithm to data and convert dendrogram to clusters
Density-Based Spatial Clustering of Applications with Noise (DBSCAN): dbscan.fit Fit DBSCAN algorithm
Hierarchical Clustering (HC): hc.fit Fit HC algorithm hc.fitPredict Fit HC algorithm to data and convert dendrogram to clusters
K-Means: kmeans.fit Fit K-Means algorithm
Dendrogram cutting functionality
Clustering Using REpresentatives (CURE): cure.cutK Cut dendrogram to k clusters cure.cutDist Cut dendrogram to clusters based on distance threshold
Hierarchical Clustering (HC): hc.cutK Cut dendrogram to k clusters hc.cutDist Cut dendrogram to clusters based on distance threshold
The clustering library provides q implementations of a number of common clustering algorithms, with fit and predict functions provided for each. Update functions are also available for K-Means and DBSCAN.
In addition to the fit/predict functionality provided for all methods, for hierarchical clustering methods (including CURE) which produce dendrograms, functions to cut the dendrogram at a given count or distance are also provided allowing a user to produce appropriate clusters.
Affinity Propagation¶
Affinity Propagation groups data based on the similarity between points and subsequently finds exemplars, which best represent the points in each cluster. The algorithm does not require the number of clusters be provided at run time, but determines the optimum solution by exchanging real-valued messages between points until a high-valued set of exemplars is produced.
The algorithm uses a user-specified damping coefficient to reduce the availability and responsibility of messages passed between points, while a preference value is used to set the diagonal values of the similarity matrix.
Affinity Propagation Algorithm Explained
Clustering Using Representatives¶
Clustering Using REpresentatives (CURE) is a technique used to deal with datasets containing outliers and clusters of varying sizes and shapes. Each cluster is represented by a specified number of representative points. These points are chosen by taking the most scattered points in each cluster and shrinking them towards the cluster center using a compression ratio.
Introduction to Clustering Techniques, Ch.7 p.242
Density-Based Spatial Clustering of Applications with Noise¶
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm groups points that are closely packed in areas of high density. Any points in low-density regions are seen as outliers.
Unlike many clustering algorithms, which require the user to input the desired number of clusters, DBSCAN calculates how many clusters are in the dataset based on two criteria.
- The minimum number of points required within a neighborhood in order for a cluster to be defined.
- The epsilon radius: The distance from each point within which points will be defined as being part of the same cluster.
Hierarchical clustering¶
Agglomerative hierarchical clustering iteratively groups data, using a bottom-up approach that initially treats all data points as individual clusters.
Introduction to Clustering Techniques, Ch.7 p.225
There are five possible linkages in hierarchical clustering: single, complete, average, centroid and ward. Euclidean or Manhattan distances can be used with each linkage except for ward (which only works with Euclidean squared distances) and centroid (which only works with Euclidean distances).
In the single and centroid implementations, a k-d tree is used to store the representative points of each cluster (see k-d tree).
The dendrogram returned can be passed to a mixture of MatPlotLib and SciPy functions which plot the dendrogram structure represented in the table. For example:
q)data:2 10#20?5.
q)HCfit:.ml.clust.hc.fit[data;`e2dist;`complete]
q)show dgram:HCfit[`modelInfo;`dgram]
i1 i2 dist n
------------------
2 7 0.3069262 2
0 8 0.6538798 2
10 4 0.8766167 3
1 5 1.018976 2
11 6 1.409634 3
3 9 2.487168 2
14 12 4.015938 6
16 13 17.68578 8
17 15 30.19258 10
q)plt:.p.import`matplotlib.pyplot
q).p.import[`scipy.cluster][`:hierarchy][`:dendrogram]flip value flip dgram;
q)plt[`:title]"Dendrogram";
q)plt[`:xlabel]"Data Points";
q)plt[`:ylabel]"Distance";
q)plt[`:show][];
Ward linkage
Ward linkage only works in conjunction with Euclidean squared distances (e2dist
), while centroid linkage only works with Euclidean distances (e2dist
, edist
). If use a different distance metric as argument, an error is signalled, as shown in the examples.
K-means¶
K-means clustering begins by selecting \(k\) data points as cluster centers and assigning data to the cluster with the nearest center.
The algorithm follows an iterative refinement process which runs a specified number of times, updating the cluster centers and assigned points to a cluster at each iteration based on the nearest cluster center.
The distance metrics that can be used with the K-means algorithm are the Euclidean distances (e2dist
,edist
). The use of any other distance metric will result in an error.
Distance metrics¶
The distance functions available in the clustering library are:
edist Euclidean distance
e2dist squared Euclidean distance
nege2dist negative squared Euclidean distance
(predominantly for affinity propagation)
mdist Manhattan distance
If you use an invalid distance metric, an error will occur.
Point matrix: a matrix in which each column represents a single datapoint
.ml.clust.ap.fit
¶
Fit Affinity Propagation algorithm
.ml.clust.ap.fit[data;df;damp;diag;iter]
Where
data
is a point matrixdf
is the distance function as a symbol:nege2dist
is recommended for this algorithm. (see Distance Metrics)damp
is the damping coefficient to be applied to the availability and responsibility matricesdiag
is the preference function for the diagonal of the similarity matrix (e.g.min
med
max
etc.)iter
is a dictionary containing the max allowed iterations and the max iterations without a change in clusters, with default values`total`noChange!200 50
(to use the defaults, pass in(::)
)
returns a dictionary containing information collected during the fitting process (modelInfo
) along with a projection of the prediction function to use on new data (predict
).
Result dictionary
All relevant information needed to evaluate the model is contained within modelInfo
. This includes
data
– original data used to fit the modelinputs
– original input parameters to the fitted modelclust
– cluster index each data point belongs toexemplars
– indices of the exemplar points
The predict functionality is contained within the predict
key.
This function takes a point matrix argument and returns the predicted clusters of the new data.
q)show data:2 10#20?10.
4.353367 2.253873 0.3467574 7.672766 3.332201 7.319711 1.692002 1.7..
1.552261 1.904628 2.108777 9.994787 3.753674 4.77256 6.354137 6.1..
// Fit an Affinity model
q)show APfit:.ml.clust.ap.fit[data;`nege2dist;.3;med;(::)]
modelInfo| `data`inputs`clust`exemplars!((4.353367 2.253873 0.3467 ..
predict | {[config;data]
config:config`modelInfo;
data:clust.i.floatConv..
// Information generated during the fitting of the model
q)APfit.modelInfo
data | (4.353367 2.253873 0.3467574 7.672766 3.332201 7.319711 ..
inputs | `df`damp`diag`iter!(`nege2dist;0.3;k){avg x(<x)@_.5*-1 0..
clust | 0 0 0 1 0 2 3 3 3 3
exemplars| 1 1 1 3 1 5 6 6 6 6
// Predict on new data
q)show newData:2 5#10?10.
4.457843 1.588047 8.627901 1.187397 7.657092
2.781109 7.581456 5.733454 0.02703805 1.695153
q)APfit.predict newData
0 3 2 0 2
.ml.clust.cure.fit
¶
Fit CURE algorithm
.ml.clust.cure.fit[data;df;n;c]
Where
data
is a point matrixdf
is the distance function as a symbol:`e2dist`edist`mdist
– see Distance Metricsn
is the number of representative pointsc
is the compression ratio
returns a dictionary containing information collected during the fitting process (modelInfo
), along with a projection of the prediction function to use on new data (predict
)
Result dictionary
All relevant information needed to evaluate the model is contained within modelInfo
. This includes the following information:
data
– original data used to fit the modelinputs
– original input parameters to the fitted modeldgram
– dendrogram generated during the fitting process
The predict functionality is contained within the predict
key. This function takes arguments
data
is a point matrixcutDict
is a dictionary where the key defines what cutting algo to use when splitting the data into clusters (k
/dist
) and the value defines the cutting threshold. (See cutDist and cutK)
and returns the predicted clusters of the new data.
q)show data:2 10#20?10.
6.12576 9.773429 6.538218 2.012211 1.841789 8.267402 7.237186 2.68311..
9.73078 8.271735 9.635953 5.188231 5.815475 3.546833 3.189686 6.27793..
// Fit a CURE model
q)show CUREfit:.ml.clust.cure.fit[data;`e2dist;2;0.]
modelInfo| `data`inputs`dgram!((6.12576 9.773429 6.538218 2.012211 1...
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
// Information generated during the fitting of the model
q)CUREfit.modelInfo
data | (6.12576 9.773429 6.538218 2.012211 1.841789 8.267402 7.23718..
inputs| `df`n`c!(`e2dist;2;0f)
dgram | +`idx1`idx2`dist`n!(0 3 11 5 10 14 13 15 17i;2 4 7 6 9 1 8 12..
// Dendrogram created
q)CUREfit[`modelInfo;`dgram]
idx1 idx2 dist n
----------------------
0 2 0.1791135 2
3 4 0.4224797 2
11 7 0.9216997 3
5 6 1.1889 2
10 9 3.063088 3
14 1 4.208402 4
13 8 10.06965 3
15 12 23.77395 7
17 16 24.59282 10
// Predict on new data
q)show newData:2 5#10?10.
6.619148 7.345548 6.878925 7.044121 6.007517
0.9989967 5.158208 9.662082 8.046487 3.449115
// Create 2 clusters
q)CUREfit.predict[newData;enlist[`k]!enlist 2]
1 1 0 0 1
// Create clusters based on distance threshold
q)CUREfit.predict[test;enlist[`dist]!enlist 1]
6 2 0 0 3
.ml.clust.cure.cutDist
¶
Generate clusters - cutting the dendrogram based on a threshold distance
.ml.clust.cure.cutDist[config;dist]
Where
config
is the output dictionary produced by the CURE fit functiondist
is the threshold distance applied when cutting the dendrogram into clusters
returns an updated config
containing a new key clust
indicating the cluster to which each data point belongs.
q)show data:2 10#20?10.
0.8501293 9.66548 9.718821 9.04914 0.6350621 7.396237 6.32245 4.2..
7.106457 7.385984 2.024464 3.601803 1.818919 3.010721 4.025844 8.7..
// Fit CURE algorithm
q)show CUREfit:.ml.clust.cure.fit[data;`e2dist;2;0.]
modelInfo| `data`inputs`dgram!((0.8501293 9.66548 9.718821 9.04914 0..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
// Dendrogram created
q)CUREfit[`modelInfo;`dgram]
idx1 idx2 dist n
---------------------
5 6 2.183492 2
10 9 2.197704 3
2 3 2.936469 2
12 11 3.081467 5
7 8 5.345063 2
1 13 13.03234 6
15 14 13.18999 8
0 16 14.45383 9
17 4 28.00431 10
// Cut the dendrogram using a distance threshold of 5
q)show cutDgram:.ml.clust.cure.cutDist[CUREfit;5]
modelInfo| `data`inputs`dgram!((0.8501293 9.66548 9.718821 9.04914 0..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
clust | 3 2 0 0 5 0 0 1 4 0
q)cutDgram`clust
3 2 0 0 5 0 0 1 4 0
.ml.clust.cure.cutK
¶
Generate clusters - cutting dendrogram into k clusters
.ml.clust.hc.cutK[config;k]
Where
config
is the output dictionary produced by the CURE fit functionk
is the number of clusters to be produced from cutting the dendrogram
returns an updated config
containing a new key clust
indicating the cluster to which each data point belongs.
q)show data:2 10#20?10.
0.8501293 9.66548 9.718821 9.04914 0.6350621 7.396237 6.32245 4.26..
7.106457 7.385984 2.024464 3.601803 1.818919 3.010721 4.025844 8.77..
// Fit CURE algorithm
q)show CUREfit:.ml.clust.cure.fit[data;`e2dist;2;0.]
modelInfo| `data`inputs`dgram!((0.8501293 9.66548 9.718821 9.04914 0..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
// Dendrogram created
q)CUREfit[`modelInfo;`dgram]
idx1 idx2 dist n
---------------------
5 6 2.183492 2
10 9 2.197704 3
2 3 2.936469 2
12 11 3.081467 5
7 8 5.345063 2
1 13 13.03234 6
15 14 13.18999 8
0 16 14.45383 9
17 4 28.00431 10
// Cut the dendrogram into 3 clusters
q)show cutDgram:.ml.clust.cure.cutK[CUREfit;3]
modelInfo| `data`inputs`dgram!((0.8501293 9.66548 9.718821 9.04914 0..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
clust | 1 0 0 0 2 0 0 0 0 0
q)cutDgram`clust
1 0 0 0 2 0 0 0 0 0
.ml.clust.cure.fitPredict
¶
Fit CURE algorithm to data and convert dendrogram to clusters
.ml.clust.cure.fitPredict[data;df;n;c;cutDict]
Where
data
is a point matrixdf
is the distance function as a symbol:`e2dist`edist`mdist
– see Distance Metricsn
is the number of representative pointsc
is the compression ratiocutDict
is a dictionary where the key defines what cutting algo to use when splitting the data into clusters (k
/dist
) and the value defines the cutting threshold. (See cutDist and cutK)
returns a dictionary containing information collected during the fitting process (modelInfo
), a projection of the prediction function to use on new data (predict
) and the cluster to which each data point belongs (clust
)
Result dictionary
All relevant information needed to evaluate the model is contained within modelInfo
. This includes the following information:
data
– original data used to fit the modelinputs
– original input parameters to the fitted modeldgram
– dendrogram generated during the fitting process
The predict functionality is contained within the predict
key. This function takes arguments
data
is a point matrixcutDict
is a dictionary where the key defines what cutting algo to use when splitting the data into clusters (k
/dist
) and the value defines the cutting threshold. (See cutDist and cutK)
and returns the predicted clusters of the new data.
The cluster each data point belongs to is contained within clust
.
q)show data:2 10#20?10.
1.473702 4.080537 3.03448 9.659883 7.874197 4.734442 8.423141 2.7..
0.72077 5.450964 4.625792 0.6486378 6.951865 9.674697 7.26315 2.4..
// Fit a CURE model and cut the dendrogram into 3 clusters
q).ml.clust.cure.fitPredict[data;`e2dist;2;0.;enlist[`k]!enlist 3]
modelInfo| `data`inputs`dgram!((1.473702 4.080537 3.03448 9.659883 ..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
clust | 0 0 0 1 1 2 1 0 1 0
.ml.clust.dbscan.fit
¶
Fit DBSCAN algorithm
.ml.clust.dbscan.fit[data;df;minPts;eps]
Where
data
is a point matrixdf
is the distance function as a symbol:`e2dist`edist`mdist
(see Distance Metrics)minPts
is the minimum number of points required in a given neighborhood to define a clustereps
is the epsilon radius, the distance from each point within which points are defined as being in the same cluster
returns a dictionary containing information collected during the fitting process (modelInfo
), a projection of the prediction function to use on new data (predict
) along with a projection of the update function (update
)
Result dictionary
All relevant information needed to evaluate the model is contained within modelInfo
. This includes
data
– original data used to fit the modelinputs
– original input parameters to the fitted modelclust
– cluster index each data point belongs to. Any outliers in the data will return a value of -1 as their cluster.tab
– neighborhood table defining information about the clusters
The predict functionality is contained within the predict
key. This function takes a point matrix argument, and returns the predicted clusters of the new data.
The update
function can be used to update the cluster centres such that the model can react to new data. This function takes a point matrix argument, and returns the updated dictionary (the result of .ml.clust.dbscan.fit
) with new data points added.
q)show data:2 10#20?10.
2.210442 8.001283 9.50319 7.346766 3.633887 5.076864 4.483854 4.28..
3.247794 7.064748 5.497131 1.792938 5.106208 2.162566 7.440406 3.08..
// Fit a DBSCAN model
q)show DBSCANfit:.ml.clust.dbscan.fit[data;`e2dist;2;1]
modelInfo| `data`inputs`clust`tab!((5.17263 5.250215 3.552399 1.58..
predict | {[config;data]
config:config[`modelInfo];
data:clust.i.floatCo..
update | {[config;data]
modelConfig:config[`modelInfo];
data:clust.i.fl..
// Information generated during the fitting of the model
q)DBSCANfit.modelInfo
data | (5.17263 5.250215 3.552399 1.588559 5.040167 9.484854 3.11..
inputs| `df`minPts`eps!(`e2dist;2;1)
clust | -1 -1 0 -1 -1 -1 0 -1 -1 -1
tab | +`nbhood`cluster`corePoint!((`long$();`long$();,6;`long$()..;
q)DBSCANfit[`modelInfo;`clust]
-1 -1 0 -1 -1 -1 0 -1 -1 -1
// Update model using new data
q)show newData:2 10#20?10.
3.369498 9.356007 1.147945 2.684219 1.860831 3.774197 6.081109 3..
3.415333 0.03463214 6.797509 6.255361 7.520247 5.643469 7.430837 9..
q)show updDBSCAN:DBSCANfit.update[newData]
modelInfo| `data`inputs`clust`tab!((5.17263 5.250215 3.552399 1.58..
predict | {[config;data]
config:config[`modelInfo];
data:clust.i.floatCo..
update | {[config;data]
modelConfig:config[`modelInfo];
data:clust.i.fl..
// Clusters from updated model
q)updDBSCAN[`modelInfo;`clust]
0 -1 1 -1 -1 2 1 0 3 0 -1 2 -1 -1 -1 1 0 -1 -1 3
// Predict on new data
q)DBSCANfit.predict[newData]
-1 -1 -1 -1 -1 0 -1 -1 -1 -1
.ml.clust.hc.fit
¶
Fit HC Algorithm
.ml.clust.hc.fit[data;df;lf]
Where
data
is a point matrixdf
is the distance function as a symbol:`e2dist`edist`mdist
(see Distance Metrics)lf
is the linkage function as a symbol:`single`complete`average`centroid`ward
returns a dictionary containing information collected during the fitting process (modelInfo
), along with a projection of the prediction function to use on new data (predict
)
Result dictionary
All relevant information needed to evaluate the model is contained within modelInfo
. This includes
data
– original data used to fit the modelinputs
– original input parameters to the fitted modeldgram
– dendrogram generated during the fitting process
The predict functionality is contained within the predict
key. This function takes arguments
data
is a point matrixcutDict
is a dictionary where the key defines what cutting algo to use when splitting the data into clusters (k
/dist
) and the value defines the cutting threshold. (See cutDist and cutK)
and returns the predicted clusters of the new data.
q)show data:2 10#20?10.
4.799813 5.330975 3.083698 2.415329 3.472484 4.094012 0.5718782 9.2..
1.897236 6.968966 2.173592 4.644757 8.286445 3.946073 1.496389 8.0..
// Fit single hierarchial model
q)show HCfit:.ml.clust.hc.fit[data;`e2dist;`single]
modelInfo| `data`inputs`dgram!((5.17263 5.250215 3.552399 1.588559 ..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
// Information generated during the fitting of the model
q)HCfit.modelInfo
data | (5.17263 5.250215 3.552399 1.588559 5.040167 9.484854 3.11..
inputs| `df`lf!`e2dist`single
dgram | +`idx1`idx2`dist`n!(2 7 0 4 10 1 15 12 17i;6 9 11 8 13 14 ..
// Dendrogram created
q)HCfit[`modelInfo;`dgram]
idx1 idx2 dist n
----------------------
2 6 0.7045331 2
7 9 1.274268 2
0 11 1.355958 3
4 8 1.46799 2
10 13 4.666131 4
1 14 7.598059 5
15 3 7.880744 6
12 16 8.508274 9
17 5 18.10505 10
// Predict on new data
q)show newData:2 10#20?10.
8.655105 2.809443 1.733521 3.591677 9.347341 9.735056 6.817983 7.624..
2.809547 4.501989 4.289929 4.224477 4.106569 3.559825 1.712474 5.554..
// Create 3 clusters
q)HCfit.predict[newData;enlist[`k]!enlist 3]
2 1 1 1 2 2 1 0 2 0
// Fit complete hierarchial model
q)show HCfitComp:.ml.clust.hc.fit[data;`e2dist;`complete]
modelInfo| `data`inputs`dgram!((5.17263 5.250215 3.552399 1.588559 5..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict.
// Dendrogram created
q)HCfitComp[`modelInfo;`dgram]
idx1 idx2 dist n
----------------------
2 6 0.7045331 2
7 9 1.274268 2
4 8 1.46799 2
0 11 1.904437 3
10 12 7.164842 4
1 3 15.32325 2
15 14 29.27081 6
16 5 63.28895 7
13 17 72.8679 10
.ml.clust.hc.cutDist
¶
Generate clusters - cutting the dendrogram based on a threshold distance
.ml.clust.hc.cutDist[config;dist]
Where
config
is the output dictionary produced by the hierarchical clustering fit functiondist
is the threshold distance applied when cutting the dendrogram into clusters
returns an updated config
containing a new key clust
indicating the cluster to which each data point belongs.
q)show data:2 10#20?10.
7.263153 2.624281 8.388946 7.931885 6.323605 9.69682 4.856966 9.1..
4.637059 7.549387 2.165773 7.280013 4.368342 5.276732 4.636653 1.0..
// Fit HC algorithm
q)show HCfit:.ml.clust.hc.fit[data;`e2dist;`single]
modelInfo| `data`inputs`dgram!((5.17263 5.250215 3.552399 1.588559..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
// Dendrogram of data
q)HCfit[`modelInfo;`dgram]
idx1 idx2 dist n
----------------------
2 6 0.7045331 2
7 9 1.274268 2
0 11 1.355958 3
4 8 1.46799 2
10 13 4.666131 4
1 14 7.598059 5
15 3 7.880744 6
12 16 8.508274 9
17 5 18.10505 10
// Cut the dendrogram using a distance threshold of 3
q)show cutDgram:.ml.clust.hc.cutDist[HCfit;3]
modelInfo| `data`inputs`dgram!((5.17263 5.250215 3.552399 1.588559 ..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
clust | 1 3 0 4 2 5 0 1 2 1
q)cutDgram`clust
1 3 0 4 2 5 0 1 2 1
.ml.clust.hc.cutK
¶
Generate clusters - cutting the dendrogram into k clusters
.ml.clust.hc.cutK[config;k]
Where
config
is the output dictionary produced by the hierarchical clustering fit functionk
is the number of clusters to be produced from cutting the dendrogram
returns an updated config
containing a new key clust
indicating the cluster to which each data point belongs.
q)show data:2 10#20?10.
7.263153 2.624281 8.388946 7.931885 6.323605 9.69682 4.856966 9.1..
4.637059 7.549387 2.165773 7.280013 4.368342 5.276732 4.636653 1.0..
// Fit HC algorithm
q)show HCfit:.ml.clust.hc.fit[data;`e2dist;`single]
modelInfo| `data`inputs`dgram!((5.17263 5.250215 3.552399 1.588559..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
// Dendrogram of data
q)HCfit[`modelInfo;`dgram]
idx1 idx2 dist n
----------------------
2 6 0.7045331 2
7 9 1.274268 2
0 11 1.355958 3
4 8 1.46799 2
10 13 4.666131 4
1 14 7.598059 5
15 3 7.880744 6
12 16 8.508274 9
17 5 18.10505 10
// Cut the dendrogram into 4 clusters
q)show cutDgram:.ml.clust.hc.cutK[HCfit;4]
modelInfo| `data`inputs`dgram!((5.17263 5.250215 3.552399 1.588559..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
clust | 1 0 0 2 0 3 0 1 0 1
q)cutDgram`clust
1 0 0 2 0 3 0 1 0 1
.ml.clust.hc.fitPredict
¶
Fit HC algorithm to data and convert dendrogram to clusters
.ml.clust.hc.fitPredict[data;df;lf]
Where
data
represents the points being analyzed in matrix format, where each column is an individual datapointdf
is the distance function as a symbol:`e2dist`edist`mdist
(see Distance Metrics)lf
is the linkage function as a symbol:`single`complete`average`centroid`ward
returns a dictionary containing information collected during the fitting process (modelInfo
), a projection of the prediction function to use on new data (predict
) and the cluster to which each data point belongs (clust
).
Result dictionary
All relevant information needed to evaluate the model is contained within modelInfo
. This includes
data
– original data used to fit the modelinputs
– original input parameters to the fitted modeldgram
– dendrogram generated during the fitting process
The predict functionality is contained within the predict
key. This function takes arguments
data
is a point matrixcutDict
is a dictionary where the key defines what cutting algo to use when splitting the data into clusters (k
/dist
) and the value defines the cutting threshold. (See cutDist and cutK)
and returns the predicted clusters of the new data.
The cluster each data point belongs to is contained in clust
.
q)show data:2 10#20?10.
6.01551 9.775468 9.809354 4.237163 5.424916 1.994707 2.496307 2.599..
1.046143 7.154895 8.098937 2.546309 6.298331 0.249301 5.341463 4.106..
// Fit a HC model and cut the dendrogram into 4 clusters
q).ml.clust.hc.fitPredict[data;`e2dist;`single;enlist[`k]!enlist 4]
modelInfo| `data`inputs`dgram!((6.01551 9.775468 9.809354 4.237163 5..
predict | {[config;data;cutDict]
updConfig:clust.i.prepPred[config;cutDict..
clust | 0 2 2 0 1 3 0 0 0 1
.ml.clust.kmeans.fit
¶
Fit K-means Algorithm
.ml.clust.kmeans.fit[data;df;k;config]
Where
data
is a point matrixdf
is the distance function:`e2dist`edist
(see Distance Metrics)k
is the number of clustersconfig
is a dictionary allowing a user to change the following model parameters (for entirely default values use(::)
)iter
the number of iterations to be completed. Default =100
init
the algorithm used to initialize cluster centers. This is either random (0b
) or uses k-means++ (1b
). Default =1b
thresh
if a cluster center moves by more than this value along any axis continue algorithm, otherwise stop. Default =1e-5
.
returns a dictionary containing information collected during the fitting process (modelInfo
), a projection of the prediction function to use on new data (predict
) along with a projection of the update function (update
)
Result dictionary
All relevant information needed to evaluate the model is contained within modelInfo
. This includes
data
– original data used to fit the modeldf
– distance metric usedrepPts
– calculated k centersclust
– cluster index each data point belongs to.
The predict functionality is contained within the predict
key. This function takes a matrix argument representing the points being analyzed in matrix format, where each column is an individual datapoint, and returns the predicted clusters of the new data.
The update
function can be used to update the cluster centres such that the model can react to new data. This function takes a point matrix argument, and returns the updated dictionary (the result of .ml.clust.kmeans.fit
) with new data points added.
show data:2 10#20?10.
9.906212 5.073676 9.560646 6.719448 3.42593 6.010412 6.137498 6.56..
0.2663305 7.935343 1.485224 8.540814 5.74697 2.619185 1.379876 3.23..
// Fit a kmeans model
show kmeansFit:.ml.clust.kmeans.fit[data;`e2dist;3;::]
modelInfo| `repPts`clust`data`inputs!((4.888017 5.941845;7.367534 ..
predict | {[config;data]
config:config[`modelInfo];
data:clust.i.floatCo..
update | {[config;data]
modelConfig:config[`modelInfo];
data:clust.i.fl..
// Information generated during the fitting of the model
q)kmeansFit.modelInfo
repPts| (4.888017 5.941845;7.367534 0.2570141;2.870303 2.005258)
clust | 0 1 0 2 0 1 0 0 2 0
data | (5.17263 5.250215 3.552399 1.588559 5.040167 9.484854 3.1..
inputs| `df`k`iter`kpp!(`e2dist;3;100;1b)
q)kmeansFit[`modelInfo;`clust]
0 1 0 2 0 1 0 0 2 0
// Update model using new data
q)show newData:2 10#20?10.
1.09627 6.292455 5.072447 2.393823 9.210309 3.421872 5.107752 9..
8.593928 2.928818 8.618995 4.764543 0.1244285 0.1204939 0.8363438 8..
q)show updKmeans:kmeansFit.update[newData]
modelInfo| `repPts`clust`data`inputs!((5.10427 6.862332;7.057065 1..
predict | {[config;data]
config:config[`modelInfo];
data:clust.i.floatCo..
update | {[config;data]
modelConfig:config[`modelInfo];
data:clust.i.fl..
// Clusters from updated model
q)updKmeans[`modelInfo;`clust]
0 1 0 2 2 1 2 0 2 0 0 1 0 0 1 2 1 0 1 0
// Predict on new data
q)kmeansFit.predict[newData]
0 1 0 0 1 2 1 0 1 0