Skip to content

Machine Learning Toolkit

The Machine Learning Toolkit (ML Toolkit) is at the core of kdb+/q centered machine learning. It describes the open-source libraries and scripts that allow users to apply machine learning models, preprocessing techniques and scoring functionality on a wide variety of kdb+ datasets.

ML Toolkit functionality is available open-source and can be found on GitHub, with relevant documentation provided on code.kx.com. The ML Toolkit documentation should be used in conjunction with the documentation presented here, with links to the appropriate sections detailed below.

Note that in addition to exposing the functionality previously available within the ML Toolkit, a number of the sections below have been 'wrapped' to improve ease-of-use within the kdb Insights ML Analytics functionality as part of an experimental API. This is outlined within the variadic section of this documentation.

Sections

Relevant documentation found within the ML Toolkit:

  1. Clustering algorithms used to group data points and to identify patterns in their distributions. The algorithms make use of a k-dimensional tree to store points and scoring functions to analyze how well they performed.
  2. An implementation of the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm in q. This allows a kdb+/q user to perform feature-extraction and feature-significance tests on structured time-series data for forecasting, regression and classification.
  3. Implementations of a number of cross validation and hyperparameter search procedures. These allow kdb+/q users to validate the performance of machine learning models when exposed to new data, test the stability of models over time or find the best hyper-parameters for tuning their models.
  4. Various implementations of time-series models for kdb+ including, but not limited to ARMA, ARIMA, SARIMA and ARCH. These allow kdb+/q users to predict the future value of datasets based on historical observations and to measure statistical properties of future data.
  5. Statistical algorithms allowing users retrieve information about the contents of their data and to build regression algorithms such as Ordinary Least Squares and Weighted Least Squares.
  6. Miscellaneous utilities including, but not limited to model metrics, data manipulation and preprocessing functions.