Skip to content

K-D tree reference

.ml.clust.kd k-d tree functions newtree build a k-d tree nn find the nearest neighbor for a datapoint findleaf find the leaf node to which a datapoint belongs

KxSystems/ml/clust/kdtree.q

A k-dimensional tree (k-d tree) is a special case of a binary search tree, commonly used in computer science to organize data points in k-dimensional space. Each leaf node in the tree contains a set of k-dimensional points, while each non-leaf node generates a splitting hyperplane which divides the surrounding space.

At each non-leaf node, the dataset is split roughly in two. A splitting dimension is chosen to reflect the axis with highest variance, and the median value for the dataset is used to split the data along this axis. The placement of each node in the tree is determined by whether the node is less than (to the left of) or greater than (to the right of) the proceeding median value. This splitting process repeats recursively throughout the tree, until a small enough number of points remain to form a leaf.

.ml.clust.kd.findleaf

Find the tree index of the leaf that a datapoint belongs to

Syntax: .ml.clust.kd.findleaf[tree;pt;node]

Where

  • tree is a k-d tree
  • pt is the point to search
  • node is a node in the k-d tree to start the search

returns the index (row) of the kd-tree that the datapoint belongs to.

q)show d:2 10#20?10.
8.132478 1.291938  1.477547 2.74227  5.635053 8.83823 ...
5.426371 0.7757332 6.374637 9.761246 5.396816 7.162858...

q)show tree:.ml.clust.kd.newtree[d;2]
leaf left self parent children axis midval   idxs
-----------------------------------------------------
0    0    0           1 4      0    6.718125 `long$()
0    1    1    0      2 3      1    5.396816 `long$()
1    1    2    1      `long$()               1 6
1    0    3    1      `long$()               2 3 4
0    0    4    0      5 6      1    6.568734 `long$()
1    1    5    4      `long$()               0 7
1    0    6    4      `long$()               5 8 9

q)// Point to search
q)show pt:2?10f
3.414991 9.516746

q).ml.clust.kd.findleaf[tree;pt;first tree]
leaf    | 1b
left    | 0b
self    | 3
parent  | 1
children| `long$()
axis    | 0N
midval  | 0n
idxs    | 2 3 4

Both .ml.clust.kd.nn and .ml.clust.kd.findleaf functions can use either q or C code in their implementations (See instructions to build C code). To switch between the implementations, .ml.clust.kd.qC[b] can be used, where b is a boolean indicating whether to use the q (1b) or C (0b) code. If the C code is not available, the function will default to q regardless of the input.

.ml.clust.kd.newtree

Build k-d tree

Syntax: .ml.clust.kd.newtree[data;leafsz]

Where

  • data represents the points being analyzed in matrix format, where each column is an individual datapoint
  • leafsz is the minimum number of datapoints contained within each leaf of the tree.

returns a k-d tree table with columns:

leaf        whether leaf (boolean)
left        whether node/leaf is to the left of its parent node (boolean)
self        tree index of current node
parent      tree index of the parent node
children    tree indexes of any child nodes
axis        splitting dimension of current node (null if leaf node)
midval      splitting value of current node (null if leaf node)
idxs        indexes (column in data) of datapoints contained in a leaf
q)show d:2 10#20?10.
5.497936 1.958467   5.615261 0.7043811 2.124007 7.77882 ...
4.57328  0.08062521 1.039343 1.044512  3.380097 4.861546...

q).ml.clust.kd.newtree[d;2]
leaf left self parent children axis midval   idxs
-----------------------------------------------------
0    0    0           1 4      1    4.57328  `long$()
0    1    1    0      2 3      0    2.124007 `long$()
1    1    2    1      `long$()               1 3
1    0    3    1      `long$()               2 4 9
0    0    4    0      5 6      0    5.497936 `long$()
1    1    5    4      `long$()               6 8
1    0    6    4      `long$()               0 5 7

q).ml.clust.kd.newtree[d;4]
leaf left self parent children axis midval  idxs
-----------------------------------------------------
0    0    0           1 2      1    4.57328 `long$()
1    1    1    0      `long$()              1 2 3 4 9
1    0    2    0      `long$()              0 5 6 7 8

The value of leafsz can affect the speed when searching for nearest neighbors.

.ml.clust.kd.nn

Find the nearest neighbor for a data point

Syntax: .ml.clust.kd.nn[tree;data;df;xidxs;pt]

Where

  • tree is a k-d tree
  • data represents the points being analyzed in matrix format, where each column is an individual datapoint
  • df is the distance function: e2dist edist mdist
  • xidxs are the indices (columns in data) to be excluded from the nearest neighbor search (() if any point can be chosen)
  • pt is the floating data point to be searched

returns a dictionary containing the distance (closestDist) and the column index in data (closestPoint) of the nearest neighbor.

q)show d:2 10#20?10.
3.927524 5.170911 5.159796  4.066642 1.780839 3.017723 ..
4.931835 5.785203 0.8388858 1.959907 3.75638  6.137452 ..

q)show tree:.ml.clust.kd.newtree[d;3]
leaf left self parent children axis midval   idxs
------------------------------------------------------
0    0    0           1 2      1    5.294808 `long$()
1    1    1    0      `long$()               0 2 3 4 8
1    0    2    0      `long$()               1 5 6 7 9
q).ml.clust.kd.nn[tree;d;`e2dist;();1 2f]
closestPoint| 4
closestDist | 3.694579

q)// finds nearest neighbor excluding point 4
q).ml.clust.kd.nn[tree;d;`e2dist;4;1 2f]
closestPoint| 3
closestDist | 9.4059