Machine Learning
Perform analytics and produce predictions on streaming data
The machine learning operators allow you to use an existing model or to create a model from streaming data to predict future values based on data within the stream.
See APIs for more details
A q interface can be used to build pipelines programatically. See the q API for API details.
A Python interface is included along side the q interface and can be used if PyKX is enabled. See the Python API for API details.
The pipeline builder uses a drag-and-drop interface to link together operations within a pipeline. For details on how to wire together a transformation, see the building a pipeline guide.
Classification Models
The below classification models are a group of supervised learning models used to assign a class to a data record. This is done using a user-provided set of labeled data records (the features and their corresponding class values) to guide the model's predictions.
AdaBoost Classifier
An AdaBoost Classifier is a model which fits a sequence of weak learners on the data and aggregates their predictions to produce a final prediction. In each iteration, the results of the previous estimator are used to adjust the weights of incorrectly classified data records to ensure the subsequent model focuses on correctly labeling these records
See APIs for more details
q API: .qsp.ml.adaBoostClassifier
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Number Of Estimators | Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1 . |
50 |
Learning Rate | Controls the loss function used to set the weight of each classifier at each boosting iteration. The higher this value, the more each classifier will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is 0.0 . |
1.0 |
Boosting Algorithm | Multi-class AdaBoost function used to extend the AdaBoost operator to have multi-class capabilities. | SAMME:R |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the model is to be stored once it is fit. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
AdaBoost, Understanding the AdaBoost Classification Algorithm
Decision Tree Classifier
A decision tree classifier predicts the value of a target by learning simple decision rules inferred from the data. These rules are used to split each node in the decision tree into child nodes to minimize impurity/entropy. It will stop splitting based on the configured Maximum Tree Depth
, Minimum Samples To Split
, and Minimum Samples At Leaf
model parameters
See APIs for more details
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Split Quality Measure | Criteria function used to measure the quality of a split each time a decision tree node is split into children. | Gini |
Node Splitting Strategy | Strategy used to split the nodes in the tree. | Best Split Chosen |
Maximum Tree Depth | Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If no value is supplied, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1 . |
|
Minimum Samples To Split Node | Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2 . |
2 |
Minimum Samples At Leaf | Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1 . |
1 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Decision Tree Classifier explained in real-life: picking a vacation destination.
Gaussian Naïve Bayes Classifier
A Gaussian Naïve Bayes Classifier uses Bayes' theorem and an independence assumption to obtain a likelihood that a data record belongs to a given class. It works off the simplifying assumption that each classes probability distribution is Gaussian
See APIs for more details
q API: .qsp.ml.gaussianNB
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Prior Class Probabilities | Prior probabilities for each class. This refers to the probability that a random data record is an instance the given class before any evidence or other factors are considered. Minimum value for each prior is 0.0 . If no value is supplied, the priors will be adjusted according to the data. |
|
Variance Smoothing | Value added to the Gaussian distributions variance to widen the curve and account for more samples further away from the distributions mean. Minimum value is 0.0 . |
1e-9 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Gaussian Naive Bayes: What you need to know?
K-Nearest Neighbors Classifier
A K-Nearest Neighbors Classifier predicts the class of a point by using the 'k' points that are nearest (neighbors) to our given point. The class of each of the k-nearest neighbors is factored into a calculation which assigns a class to our unclassified point
See APIs for more details
q API: .qsp.ml.kNeighborsClassifier
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Number Of Neighbors | Number of already classified points, which lie closest to a given unclassified point (neighbors), to factor in when predicting the points class. Minimum value is 1 . |
5 |
Neighbor Selection Weights | Weight function used to decide how much weight to give to the classes of each of the neighboring points when predicting a points class. | Uniform |
Algorithm For Finding Neighbors | Algorithm used to parse the vector space and decide which points are the nearest neighbors to a given unclassified point. | Auto |
Size Of The Leaves | If Ball Tree or KD Tree is selected as the Algorithm For Finding Neighbors , this is the minimum number of points in a given leaf node, after which point, brute force algorithm will be used to find the nearest neighbors. Setting this value either very close to 1 or very close to the total number of points in the data may have a noticeable impact on model runtime. Minimum value is 1 . |
30 |
Distance Metric | Distance metric used to measure the distance between points. | Minkowski |
Minkowski Power | Power parameter used when the distance metric Minkowski Distance is selected. Minimum value is 0 . |
2 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Quick Introduction to K-Nearest Neighbors Classifier
Logistic Classifier
A Logistic Classifier is a classification model in which the conditional probability of one of the possible output classes is assumed to be equal to a linear combination of the input variables, transformed by the logistic function. This is an online model so after the initial model fit, further training on new data can be done to update its behavior
See APIs for more details
q API: .qsp.ml.logClassifier
Python API: kxi.sp.ml.log_classifier
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Learning Rate | Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is 0.0 . |
0.01 |
Maximum Iterations | Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1 . |
100 |
Optimization Tolerance | Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0 . |
0.00001 |
Random Seed | Integer value used to control the randomness of the model's initialization state. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp. | |
Regularization Method | Penalty term used to shrink the coefficients of the less contributive variables. | L2 |
Regularization Coefficient | Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is 0.0 . |
0.001 |
ElasticNet Mixing Parameter | If Elastic Net is chosen as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to 0 , this is the same as using L2 regularization, if this value is set to 1 , this is the same as using L1 regularization. This value must lie in the range (0.0, 1.0] . |
0.5 |
Decay Coefficient | Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important the historic predictions will be. Minimum value is 0.0 . |
0.0 |
Momentum Coefficient | Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is 0.0 . |
0.0 |
Add Fit Intercept | Whether to add a constant value (intercept) to the classification function - c in the y=mx+c . |
1b |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Logistic Classification using SGD, Logistic Regression Classifier
Multi-Layer Perceptron Classifier
A Multi-Layer Perceptron Classifier is a simple feed-forward artificial neural network (ANN) model. It tackles classification problems by minimizing the log-loss function.
See APIs for more details
q API: .qsp.ml.MLPClassifier
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Hidden Layer Sizes | List of the number of neurons in each hidden layer in the neural network. Minimum size of each layer is 1 . |
enlist 100 |
Activation Function Name | Activation function used to transform the output of the hidden layers into a single scalar value. | ReLU |
Optimization Function Name | Optimization function used to search for the inputs that minimize/maximize the results of the model function. | Adam |
L2 Regularization Term Strength | Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss function and is used to reduce the chance of model overfitting. Minimum value is 0.0 . |
0.0001 |
Batch Size | Number of training examples used in each stochastic optimization iteration. Minimum value is 1 . |
auto |
Learning Rate | Learning rate schedule for updating the weights of the neural network. Only used when the optimization function set is Stochastic Gradient Descent . |
constant |
Starting Learning Rate | Starting learning rate value. This controls the step-size used when updating the neural network weights. Not used when the optimization function is set to Limited-Memory BFGS . Minimum value is 0.0 . |
0.001 |
Exponent For Inverse Scaling | Exponent used to update the learning rate when the learning rate is set to Inverse Scaling Learning Rate and the optimization function is set to Stochastic Gradient Descent . |
0.5 |
Maximum Iterations | Maximum number of optimization epochs/iterations. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1 . |
200 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Multilayer Perceptron, Multilayer Perceptron: A simple multilayer neural network
Quadratic Discriminant Analysis Classifier
A Quadratic Discriminant Analysis Classifier uses a quadratic decision boundary to try to separate points from different classes. This decision boundary is generated by fitting a Gaussian conditional density function to each class in the data and using Bayes' rule to classify new points
See APIs for more details
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Prior Class Probabilities | Prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is 0.0 . If no value is supplied, the priors will be adjusted according to the data. |
|
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Quadratic Discriminant Analysis
Random Forest Classifier
A Random Forest Classifier is a meta-estimator that fits a number of decision tree classifiers to various sub-samples of the dataset. The average of the predictions from these classifiers is then used to inform the output of the random forest model. Using multiple smaller models helps to increase predictive accuracy and reduce overfitting
See APIs for more details
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted class labels. |
Optional Parameters:
name | description | default |
---|---|---|
Number Of Estimators | Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1 . |
100 |
Split Quality Measure | Criteria function used to measure the quality of a split each time a decision tree node is split into children. | gini |
Maximum Tree Depth | Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If no value is supplied, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1 . |
|
Minimum Samples To Split Node | Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2 . |
2 |
Minimum Samples At Leaf | Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1 . |
1 |
Leaf Minimum Weight Fraction | Minimum proportion of sample weight required to be at any leaf node relative to the total weight of all samples in the tree. In this case, each sample carries equal weight. This value must lie in the range [0.0, 1.0] . |
0.0 |
Maximum Features Considered | Maximum number of features to consider when looking for the best way to split a node. | auto |
Maximum Leaf Nodes | Maximum number of leaf nodes in each decision tree. This forces the tree to grow in a best-first fashion with the best nodes based on their relative reduction in impurity. If no value is supplied, there may be unlimited leaf nodes. Minimum value is 1 . |
|
Minimum Impurity Decrease | Minimum impurity decrease value required to split a node. If the tree impurity would not decrease by more than this value, the node will not be split. Minimum value is 0.0 . |
0.0 |
Use Bootstrap Samples | Whether bootstrap samples are used when building the decision trees. If this box is unchecked, the whole dataset is used to build each tree. | 1b |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Clustering Models
The below clustering models are a group of unsupervised learning models used to group similar data records together into clusters. These models do not need direction from the user but will define these groups using the features in the input data.
Affinity Propagation Clustering
Affinity Propagation is a graph based clustering algorithm which does not require users to specify the number of clusters in the data beforehand. This algorithm identifies the most representative points in the data (exemplars) and then iteratively sends messages between pairs of these exemplars to decide whether one exemplar represents the other. Redundant exemplars are iteratively removed until there is convergence on a final set of exemplars. These become the final output clusters
See APIs for more details
q API: .qsp.ml.affinityPropagation
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Cluster Prediction Column Name | Name of the column which is to house the model's predicted cluster labels. |
Optional Parameters:
name | description | default |
---|---|---|
Damping Factor | Provides numerical stabilization and limits oscillations and “overshooting” of parameters by controlling the extent to which the current value is maintained relative to incoming values. This value must lie in the range [0.5, 1.0) . |
0.5 |
Maximum Iterations | Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1 . |
200 |
Convergence Iterations | Number of iterations, during which there is no change in the number of estimated clusters, needed to stop the convergence. Minimum value is 1 . |
15 |
Affinity | Statistical measure used to define similarities between the representative points. | Negative Squared Euclidean Distance |
Random State | Integer value used to control the state of the random generator used in this model. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp. | |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Affinity Propagation Clustering, How Affinity Propagation works
Birch Clustering
Birch is a hierarchical clustering method which involves the construction of a Clustering Feature Tree (CFT) where the leaves on the tree are the centroids of the clusters. It is particularly useful for large datasets due to the limited memory required by the tree structure
See APIs for more details
q API: .qsp.ml.birch
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Cluster Prediction Column Name | Name of the column which is to house the model's predicted cluster labels. |
Optional Parameters:
name | description | default |
---|---|---|
Subcluster Radius Threshold | Maximum cluster radius allowed for a new sample to be merged into its closest subcluster. If adding this point to a cluster would cause that clusters radius to exceed this maximum, the new point is not added and instead becomes a new subcluster. Minimum value is 0.0 . |
0.5 |
Maximum Clusters In Node | Maximum number of subclusters in each node in the tree, where each leaf node contains a subcluster. If a new sample arrives causing the number of subclusters to exceed this value for a given node, the node is split into two nodes. Minimum value is 1 . |
50 |
Number Of Clusters | Final number of clusters to be defined by the model. Minimum value is 2 . |
3 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Cure Clustering
Clustering Using REpresentatives (CURE) is a hierarchical clustering method whereby each point is initially set as its own cluster and then similar clusters are merged until convergence on a final set of clusters. As clusters get bigger, each cluster is represented by a specified number of points which are used to measure similarity of clusters
See APIs for more details
q API: .qsp.ml.cure
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Cluster Prediction Column Name | Name of the column which is to house the model's predicted cluster labels. |
Optional Parameters:
name | description | default |
---|---|---|
Distance Function | Distance function used to measure the distance between points when clustering. | Euclidean Distance |
Representative Points | Number of representative points to choose from each cluster to compare the similarity of clusters for the purposes of potentially merging them. Minimum value is 1 . |
2 |
Compression Ratio | Compression factor used for grouping the representative points together. Minimum value is 0.0 . |
0.0 |
Number Of Clusters | Final number of clusters to be defined by the model. Minimum value is 2 . |
|
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
CURE Algorithm, Clustering in Machine Learning
DBSCAN Clustering
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a clustering algorithm that initially finds core samples of high-density points and expands clusters out from these samples. This works well when the data contains clusters of a similar density
See APIs for more details
q API: .qsp.ml.dbscan
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Cluster Prediction Column Name | Name of the column which is to house the model's predicted cluster labels. |
Optional Parameters:
name | description | default |
---|---|---|
Distance Function | Distance function used to measure the distance between points when clustering. | Euclidean Distance |
Minimum Points To Create Cluster | Minimum number of points required to be close together before this group of points is defined as a cluster. The distance away from one another must be less than or equal to the Maximum Distance Between Points parameter. Minimum value is 1 . |
2 |
Maximum Distance Between Points | Maximum distance points are allowed to be away from one another to still be classed as close enough to be in the same cluster. Minimum value is 0.0 . |
1.0 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
DBSCAN, DBSCAN Clustering - Explained
Sequential K-Means Clustering
Sequential K-Means is a clustering algorithm which sets out to minimize the distance between each point and its respective cluster centroid. Once an initial set of cluster centroids are determined, the algorithm first assigns each value to a cluster, and then updates the cluster centroids, so they are in the center of their respective data points. This is done iteratively until convergence. This is an online model so after the initial model fit, further training on new data can be done to update its behavior
See APIs for more details
q API: .qsp.ml.sequentialKMeans
Python API: kxi.sp.ml.sequential_k_means
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Prediction Column Name | Name of the column which is to house the model's predicted cluster labels. |
Optional Parameters:
name | description | default |
---|---|---|
Distance Function | Distance function used to measure the distance between points when clustering. | Euclidean Distance |
Number Of Clusters | Final number of clusters to be defined by the model. Minimum value is 2 . |
3 |
Cluster Centroid Initialization | Initialization method for the cluster centroids. This value can either be K-means++ or randomized initialization. | K-Means++ |
Apply Forgetful | Whether to apply forgetful Sequential K-Means or normal Sequential K-Means. Forgetful Sequential K-Means will allow the model to evolve its cluster boundaries over time by forgetting about old data as new data comes in. | 1b |
Learning Rate | Controls the rate at which the concept of forgetfulness is applied within the algorithm. If forgetful Sequential K-Means is applied, this value defines how much past cluster centroid information is retained, if not, this is set to 1/(n+1) where n is the number of points in a given cluster. This value must lie in the range [0.0, 1.0] . |
0.1 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Regression Models
The below regression models are a group of supervised models used to predict a target value for each data record. This is done by using a user-provided set of labeled data records (the features and their corresponding target values) to guide the model's predictions.
AdaBoost Regressor
An AdaBoost Regressor fits a sequence of weak learners on the data and aggregates their predictions to produce a final prediction. In each iteration, the results of the previous estimator are used to adjust the weights of incorrectly predicted records to ensure the subsequent model focuses on correctly labeling these data records.
See APIs for more details
q API: .qsp.ml.adaBoostRegressor
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted target values. |
Optional Parameters:
name | description | default |
---|---|---|
Number Of Estimators | Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1 . |
50 |
Learning Rate | Weight applied to each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is 0.0 . |
1.0 |
Loss Function | Loss function used to update the contributing weights of the regressors after each boosting iteration. | linear |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. This value must lie in the range (0.0, inf) . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Understanding the AdaBoost Regressor Algorithm
Decision Tree Regressor
A Decision Tree Regressor predicts the value of a target by learning simple decision rules inferred from the data. These rules are used to split each node in the decision tree into child nodes to minimize a given loss function. It will stop splitting based on the configured Maximum Tree Depth
, Minimum Samples To Split
, and Minimum Samples At Leaf
model parameters.
See APIs for more details
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted target values. |
Optional Parameters:
name | description | default |
---|---|---|
Split Quality Measure | Criteria function used to measure the quality of a split each time a decision tree node is split into children. | squared_error |
Node Splitting Strategy | Strategy used to split the nodes in the tree. | |
best |
||
Maximum Tree Depth | Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If no value is supplied, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1 . |
|
Minimum Samples To Split | Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2 . |
2 |
Minimum Samples At Leaf | Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1 . |
1 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Decision Tree Regressor Explained in Depth
Gradient Boosting Regressor
A Gradient Boosting Regressor is an additive model that is built using a sequence of sparse regression estimators. It allows for the optimization of an arbitrary differentiable loss function. At each stage, a regression tree is fit on the negative gradient of a given loss function.
See APIs for more details
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted target values. |
Optional Parameters:
name | description | default |
---|---|---|
Number Of Estimators | Maximum number of tree estimators to train. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1 . |
100 |
Loss Function | Loss function that is optimized using gradient descent to get the best model fit. | squared_error |
Learning Rate | Controls the loss function used to set the weight of each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is 0.0 . |
0.1 |
Minimum Samples To Split | Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2 . |
2 |
Minimum Samples At Leaf | Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1 . |
1 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be | |
stored. | local |
|
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Gradient Boosting for Beginners
K-Nearest Neighbors Regressor
A K-Nearest Neighbors Regressor predicts the target value for a given point by using the 'k' points that are nearest (neighbors) to our given point. The target associated with each of the k-nearest neighbors is factored into an interpolation calculation which assigns a target prediction to our point.
See APIs for more details
q API: .qsp.ml.kNeighborsRegressor
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted target values. |
Optional Parameters:
name | description | default |
---|---|---|
Number Of Neighbors | Number of points already labeled or predicted, which lie closest to a given unclassified point (neighbors), to factor in when predicting the points class. Minimum value is 1 . |
5 |
Neighbor Selection Weights | Weight function used to decide how much weight to give to each of the neighboring points when predicting the target of a point. | uniform |
Distance Metric | Distance metric used to measure the distance between points. | minkowski |
Algorithm For Finding Neighbors | Algorithm used to parse the vector space and decide which points are the nearest neighbors. | auto |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
k-nearest neighbors algorithm, also
Lasso Regressor
A Lasso Regressor is a linear model using L1 as the regularization term. If the L1 Regularization Coefficient
parameter is set to 0
, hence removing the L1 regularization term, this is the same as running a Linear Regression model.
See APIs for more details
q API: .qsp.ml.lasso
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted target values. |
Optional Parameters:
name | description | default |
---|---|---|
Maximum Iterations | Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1 . |
1000 |
L1 Regularization Coefficient | Constant that controls the regularization strength by multiplying the L1 regularization term. Minimum value is 0.0 . |
1,0 |
Optimization Tolerance | Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0 . |
0.0001 |
Add Fit Intercept | Whether to add a constant value (intercept) to the regression function - c in y=mx+c . |
true |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Linear Regressor
A Linear Regressor is a regression model in which the target value for a data record is assumed to be equal to a linear combination of the input variables. This is an online model so after the initial model fit, further training on new data can be done to update its behavior.
See APIs for more details
q API: .qsp.ml.linearRegression
Python API: kxi.sp.ml.linear_regression
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted target values. |
Optional Parameters:
name | description | default |
---|---|---|
Learning Rate | Learning rate schedule used during model training to control how the model’s coefficients are changed each time the model is updated. Minimum value is 0.0 . |
0.01 |
Maximum Iterations | Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1 . |
100 |
Optimization Tolerance | Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0 . |
0.00001 |
Random Seed | Integer value used to control the randomness of the model's initialization state. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp. | |
Regularization Method | Penalty term used to shrink the coefficients of the less contributive variables. | l2 |
Regularization Coefficient | Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is 0.0 . |
0.001 |
Elastic Net Mixing Parameter | If Elastic Net is used as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to 0 , this is the same as using L2 regularization, if this value is set to 1 , this is the same as using L1 regularization. This value must lie in the range [0.0, 1.0] . |
0.5 |
Decay Coefficient | Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum value is 0.0 . |
0 |
Momentum Coefficient | Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is 0.0 . |
0 |
Add Fit Intercept | Whether to add a constant value (intercept) to the regression function - c in y=mx+c . |
true |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Random Forest Regressor
A Random Forest Regressor is a meta-estimator that fits a number of decision tree regressors to various sub-samples of the dataset. The average of the predictions from these regressors is then used to inform the output of the random forest model. Using multiple smaller models helps to increase predictive accuracy and reduce overfitting.
See APIs for more details
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Target Column Name | Name of the column containing the data's target labels. | |
Prediction Column Name | Name of the column which is to house the model's predicted target values. |
Optional Parameters:
name | description | default |
---|---|---|
Number Of Estimators | Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1 . |
100 |
Split Quality Measure | Criteria function used to measure the quality of a split each time a decision tree node is split into children. | squared_error |
Maximum Tree Depth | Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If no value is supplied, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1 . |
|
Minimum Samples To Split | Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2 . |
2 |
Minimum Samples At Leaf | Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1 . |
1 |
Buffer Size | Number of records to observe before fitting the model. If set to 0 , the model will be fit on the first batch. Minimum value is 0 . |
0 |
Registry Type | Type of the registry where the fitted model is to be stored. | local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. |
Registry Functions
The below registry functions are a group of functions that deal with models in the registry.
Predict
Make a prediction using a model that is stored in the registry. This model will be loaded in from the registry, the input features will be run through this model, and the model will return a target prediction
See APIs for more details
q API: .qsp.ml.registry.predict
•
Python API: kxi.sp.ml.predict
Required Parameters:
name | description | default |
---|---|---|
Feature Column Name | Name of the column(s) containing the features from the data. | |
Prediction Column Name | Name of the column which is to house the model's predicted class/cluster/target values. |
Optional Parameters:
name | description | default |
---|---|---|
Registry Type | Type of the registry where the fitted model is to be stored. | Local |
Registry Path | Location within the specified registry type where the fitted model is to be stored. This can be a local path or a cloud storage path. | /tmp |
Experiment Name | Name of the experiment in the registry that the fitted model is to be stored under. If no value is supplied, the model will be stored under unnamedExperiments . |
|
Model Name | Name of the model to be stored in the registry. If no value is supplied, the model will not be stored in the registry. | |
Model Version | Version of the fitted model we want to load in the registry. If no value is supplied, the latest version of the model will be loaded. |
FRESH
FRESH stands for FeatuRe Extraction and Scalable Hypothesis testing.
FRESH Create
This model creates aggregate statistical features based on the input data
See APIs for more details
q API: .qsp.ml.freshCreate
•
Python API: kxi.sp.ml.fresh_create
Required Parameters:
name | description | default |
---|---|---|
Column Name | Name of the column(s) in the data to use for FRESH feature generation. |
Optional Parameters:
name | description | default |
---|---|---|
FRESH Feature Name | Name of the FRESH feature(s) we want to define from the data. If no value is supplied, all FRESH features will be defined. A full list of these features can be found here. |
Preprocessing Functions
The below preprocessing functions are a group of functions used to manipulate and restructure the data to get it into a format which will enhance the predictive performance of the models we want to train on this data.
Drop Constant
Removes columns that contain a single constant value throughout from an input table
See APIs for more details
q API: .qsp.ml.dropConstant
•
Python API: kxi.sp.ml.drop_constant
Required Parameters:
name | description | default |
---|---|---|
Column Name | Name of the column(s) in the input table to remove because they contain a constant value throughout. |
Optional Parameters:
name | description | default |
---|---|---|
Buffer Size | Number of records to observe before dropping the constant columns from the data. If set to 0 , the operator will be applied on the first batch. Minimum value is 0 . |
0 |
Label Encoder
Encodes symbol or string typed columns in an input table as integer values. E.g., if there were 3 symbols in the original categorical column, the Label Encoded column would contain 3 integers with each value corresponding to one of the 3 original symbols
See APIs for more details
q API: .qsp.ml.labelEncode
•
Python API: kxi.sp.ml.label_encode
Required Parameters:
name | description | default |
---|---|---|
Column Name | Name of the column(s) in the input table whose labels we want to encode. |
Optional Parameters:
name | description | default |
---|---|---|
Buffer Size | Number of records to observe before dropping the constant columns from the data. If set to 0 , the operator will be applied on the first batch. Minimum value is 0 . |
0 |
Min-Max Scaler
Scales the numerical columns in an input table by setting the minimum value in each column to 0
, setting the maximum value in the column to 1
, and then scaling the values which lie between these minimum and maximum values so that the relative distance between each value is maintained but that all values now lie between 0
and 1
.
See APIs for more details
q API: .qsp.ml.minMaxScaler
•
Python API: kxi.sp.ml.min_max_scaler
Required Parameters:
name | description | default |
---|---|---|
Column Name | Name of the column(s) in the input table whose values we want to scale. |
Optional Parameters:
name | description | default |
---|---|---|
Raise Range Error | Whether to raise a range error if new input data falls outside the minimum and maximum data range observed during the initialization of the operator. | 0b |
Buffer Size | Number of records to observe before scaling the numeric columns in the data. If set to 0 , the operator will be applied on the first batch. Minimum value is 0 . |
0 |
One-Hot Encoder
Encodes each symbol and string typed column as a number of integer typed columns. If a categorical column has ‘n’ distinct symbols, this will be encoded into ‘n’ individual new columns, one for each symbol. Each of these new columns will be populated by either a 0
or a 1
stating whether a given row in the table was previously populated with the underlying symbol value this column represents
See APIs for more details
q API: .qsp.ml.oneHot
•
Python API: kxi.sp.ml.one_hot_encode
Required Parameters:
name | description | default |
---|---|---|
Column Name | Name of the column(s) in the input table to one-hot encode. |
Optional Parameters:
name | description | default |
---|---|---|
Buffer Size | Number of records to observe before scaling the numeric columns in the data. If set to 0 , the operator will be applied on the first batch. Minimum value is 0 . |
0 |
Standardize
Scales the values in each numeric column in an input table so that the mean of these columns becomes 0
and the standard deviation of these columns becomes 1
. This can be done simply by subtracting the columns mean value from each value in the column and then dividing the resulting values by the column’s standard deviation
See APIs for more details
q API: .qsp.ml.standardize
•
Python API: kxi.sp.ml.standardize
Required Parameters:
name | description | default |
---|---|---|
Column Name | Name of the column(s) in the input table to standardize. |
Optional Parameters:
name | description | default |
---|---|---|
Buffer Size | Number of records to observe before scaling the numeric columns in the data. If set to 0 , the operator will be applied on the first batch. Minimum value is 0 . |
0 |