Scoring metrics reference¶
.ml.clust Scoring metrics
Unsupervised learning daviesBouldin Davies-Bouldin index dunn Dunn index silhouette Silhouette score
Supervised learning homogeneity Homogeneity score between predictions and actual value
Optimum number of clusters elbow Distortion scores for increasing numbers of clusters
Scoring metrics allow you to validate the performance of your clustering algorithms in three distinct use cases.
- Unsupervised learning
-
These metrics analyze how well data has been assigned to clusters, measuring intra-cluster similarity (cohesion) and differences (separation). In general, clustering is said to be successful if clusters are well spaced and densely packed. Used when the true cluster assignment is not known.
- Supervised learning
-
If the true and predicted labels of the dataset are known, clusters can be analyzed in a supervised manner by comparing true and predicted labels.
- Optimum number of clusters
-
The optimum number of clusters can be found manually in a number of ways using techniques above. If the required number of clusters is not known prior to clustering, the Elbow Method is used to estimate the optimum number of clusters within the dataset using K-means clustering.
.ml.clust.daviesBouldin
¶
Davies-Bouldin index
.ml.clust.daviesBouldin[data;clusts]
Where
data
represents the points being analyzed in matrix format, where each column is an individual datapointclusts
is the list of clusters returned by one of the clustering algorithms in.ml.clust
returns the Davies-Bouldin index, where a lower value indicates better clustering, with well-separated, tightly-packed clusters.
q)show data:2 10#20?10.
4.126605 8.429965 6.214154 5.365242 7.470449 6.168275 6.876426 6.123797 9.363..
4.45644 7.274244 1.301704 2.018829 1.451855 9.819545 7.490215 6.372719 5.856..
q)show clusts1:10?3
0 1 2 0 1 0 0 1 0 1
q)show clusts2:10?3
2 2 1 0 2 2 1 2 0 0
q)// Lower values indicate better clustering
q).ml.clust.daviesBouldin[data;clusts1]
9.014795
q).ml.clust.daviesBouldin[data;clusts2]
5.890376
The Davies-Bouldin index works by calculating the ratio of how scattered data points are within a cluster, to the separation that exists between clusters.
.ml.clust.dunn
¶
Dunn index
.ml.clust.dunn[data;df;clusts]
Where
data
represents the points being analyzed in matrix format, where each column is an individual datapointdf
is the distance function as a symbol, e.g.e2dist
edist
mdist
clusts
is the list of clusters returned by the clustering algorithms in.ml.clust
returns the Dunn index, where a higher value indicates better clustering, with well-separated, tightly-packed clusters.
q)show data:2 10#20?10.
3.927524 5.170911 5.159796 4.066642 1.780839 3.017723 7.85033 5.347096..
4.931835 5.785203 0.8388858 1.959907 3.75638 6.137452 5.294808 6.916099..
q)show clusts1:10?3
0 0 1 1 0 0 2 0 1 0
q)show clusts2:10?3
0 0 1 1 0 2 0 2 1 2
q)// Higher values indicate better clustering
q).ml.clust.dunn[data;`edist;clusts1]
0.5716933
q).ml.clust.dunn[data;`e2dist;clusts2]
0.03341283
The Dunn index is calculated based on the minimum inter-cluster distance divided by the maximum size of a cluster.
.ml.clust.elbow
¶
The elbow method
.ml.clust.elbow[data;df;k]
Where
data
represents the points being analyzed in matrix format, where each column is an individual datapointdf
is the distance function as a symbol, e.g.e2dist
edist
k
is the maximum number of clusters
returns distortion scores for each set of clusters produced by k-means, with increasing values of k up to the user defined value.
q)show data:2 10#20?10.
3.927524 5.170911 5.159796 4.066642 1.780839 3.017723 7.85033 5.347096..
4.931835 5.785203 0.8388858 1.959907 3.75638 6.137452 5.294808 6.916099..
q).ml.clust.elbow[data;`edist;5]
16.74988 13.01954 10.91546 9.271871
If the values produced by .ml.clust.elbow
are plotted, it is possible to determine the optimum number of clusters. The above example produces the following graph
It is clear that the elbow score occurs when the data should be grouped into 3 clusters.
.ml.clust.homogeneity
¶
Homogeneity score
.ml.clust.homogeneity[pred;true]
Where
pred
is the predicted cluster labelstrue
is the true cluster labels
returns the homogeneity score, bounded between 0 and 1, with a high value indicating a more accurate labeling of clusters.
q)show true:10?3
2 1 0 0 0 0 2 0 1 2
q)show pred:10?3
2 1 2 0 1 0 1 2 0 1
q).ml.clust.homogeneity[pred;true]
0.225179
q).ml.clust.homogeneity[true;true]
1f
Homogeneity score works on the basis that a cluster should contain only samples belonging to a single class.
.ml.clust.silhouette
¶
Silhouette coefficient
.ml.clust.silhouette[data;df;clusts;isAvg]
Where
data
represents the points being analyzed in matrix format, where each column is an individual datapointdf
is the distance function as a symbol, e.g.e2dist
edist
mdist
clusts
is the list of clusters returned by the clustering algorithms in.ml.clust
isAvg
is a boolean -1b
to return the average coefficient,0b
to return a list of coefficients
returns the Silhouette coefficient, ranging from -1 (overlapping clusters) to +1 (separated clusters).
q)show data:2 10#20?10.
3.927524 5.170911 5.159796 4.066642 1.780839 3.017723 7.85033 5.347096..
4.931835 5.785203 0.8388858 1.959907 3.75638 6.137452 5.294808 6.916099..
q)show clusts1:10?3
0 0 1 1 0 0 2 0 1 0
q)show clusts2:10?3
0 0 1 1 0 2 0 2 1 2
q)// Return the averaged coefficients across all points
q).ml.clust.silhouette[data;`edist;clusts1;1b]
0.3698386
q).ml.clust.silhouette[data;`e2dist;clusts2;1b]
0.2409856
q)// Return the individual coefficients for each point
q).ml.clust.silhouette[data;`e2dist;clusts2;0b]
-0.4862092 -0.6652588 0.8131323 0.595948 -0.2540023 0.5901292 -0.2027718 0.61..
The Silhouette coefficient measures how similar an object is to the members of its own cluster when compared to other clusters.