Model generation and deployment

The following example provides a sample workflow for:

Generation of a model to be used in a production environment
Persistence of this model to cloud storage for use in deployment
Deployment of the model and preprocessing steps to a production environment

This is intended to provide a sample of such a workflow and is not intended to be fully descriptive, users are encouraged to follow the API documentation here to get full use of the functionality.

Model Generation

1) Start the docker container as a development environment following the instructions here.

Ensure that the image has been started such that it points explicitly to a cloud storage bucket, in the example below this is done using S3.

Note

For this example a user is expected to have write access to a pre generated AWS bucket at s3://ml-aws-storage.

docker run -it -p 5000:5000 \
    -e "KDB_LICENSE_B64=$KDB_LICENSE_B64" \
    -e "AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" \
    -e "AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
    -e "AWS_REGION=$AWS_REGION" \
    registry.dl.kx.com/kxi-ml:latest \
    -aws s3://my-aws-storage -p 5000

2) Retrieve a dataset for generation of a model

In this case we are using the Wisconsin Breast Cancer dataset to predict if a tumour is malignant or benign. This example follows broadly that outlined in the ml-notebooks here.

q)dataset :.p.import[`sklearn.datasets;`:load_breast_cancer][]
q)features:dataset[`:data]`
q)target  :dataset[`:target]`

3) Split the data into a training and testing set to validate model performance

To validate that the model is performing appropriately we set aside a testing set which can be used to independently validate the performance of the model. This is done using the function .ml.trainTestSplit provided with the kdb Insights Machine Learning package. To ensure we are seeing enough samples in the training phase the test size is 10% of the original data.

q)data:.ml.trainTestSplit[features;target;0.1]

4) Build and train a model

In this example we use embedPy to generate the models used. These could equally be created using functionality within the kdb Insights Machine Learning package however this example is intended to showcase the use of Python models in this regime

// Fit a Decision Tree Classifier
q)clf:.p.import[`sklearn.tree]`:DecisionTreeClassifier
q)clf:clf[`max_depth pykw 3]
q)clf[`:fit][data`xtrain;data`ytrain];

// Fit a Random Forest Classifier
q)rf:.p.import[`sklearn.ensemble]`:RandomForestClassifier
q)rfkwargs:`class_weight`max_depth!(`balanced;80)
q)rf:rf[pykwargs rfkwargs]
q)rf[`:fit][data`xtrain;data`ytrain]

5) Validate model performance

Calculate the accuracy of predictions for each of the models:

q)show .ml.accuracy[clf[`:predict][data`xtest]`;data`ytest];
q)show .ml.accuracy[rf[`:predict][data`xtest]`;data`ytest];

6) Publish models to the Registry

Once you are happy with the performance of the models publish them to the Machine Learning Registry at s3://my-aws-storage. This follows the documentation outlined in the Registry section here.

// Set the decision tree classifier to the 'Wisconsin' experiment
q).ml.registry.set.model[::;"Wisconsin";clf;"DecisionTree";"sklearn";::]

// Set the random forest classifier to the 'Wisconsin' experiment
q).ml.registry.set.model[::;"Wisconsin";clf;"RandomForest";"sklearn";::]

Model Docker Deployment

1) Generate a spec.q file defining deployment of the model generated above

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {select from x};
    `yhat;
    .qsp.use (!) . flip (
      (`registry ; enlist[`aws]!enlist "s3://my-aws-storage");
      (`model    ; "RandomForest");
      (`version  ; 1 0)
      )
    ]
  .qsp.write.toConsole[]

2) Set up a Docker Compose file for the example

# docker-compose.yaml

version: "3.3"
services:
  controller:
    image: registry.dl.kx.com/kxi-sp-controller:0.11.0
    ports:
      - 6000:6000
    environment:
      - KDB_LICENSE_B64                        # Which kdb+ license to use, see note below
    command: ["-p", "6000"]

  worker:
    image: registry.dl.kx.com/kxi-ml:0.8.0
    ports:
      - 5000:5000
    volumes:
      - .:/app                                 # Bind in the spec.q file
    environment:
      - KXI_SP_SPEC=/app/spec.q                # Point to the bound spec.q file
      - KXI_SP_PARENT_HOST=controller:6000     # Point to the parent Controller
      - KDB_LICENSE_B64
      - AWS_ACCESS_KEY_ID                     # Use AWS_ACCESS_KEY_ID defined in process
      - AWS_SECRET_ACCESS_KEY                 # Use AWS_SECRET_ACCESS_KEY defined in process
      - AWS_REGION                            # Use AWS_REGION defined in processor
      - KXI_SP_CHECKPOINT_FREQ=0              # Set the checkpoint frequency to 0
    command: ["-p", "5000"]

3) Start the container and follow logs

$ docker-compose up -d
$ docker-compose logs -f

4) Connect to the q process running the pipeline and push data for prediction

$ q p.q
q)h:hopen 5000
q)data  :.p.import[`sklearn.datasets;`:load_breast_cancer][]
q)feat  :data[`:data]`
q)fnames:`$ssr[;" ";"_"]each data[`:feature_names][`:tolist][]`
q)tab:flip fnames!flip feat
q)h(`publish;3?tab)

Model Kubernetes Deployment

1) Generate a spec.q file defining deployment of the model generated above

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {select from x};
    `yhat;
    .qsp.use (!) . flip (
      (`registry ; enlist[`aws]!enlist "s3://my-aws-storage");
      (`model    ; "RandomForest");
      (`version  ; 1 0)
      )
    ]
  .qsp.write.toConsole[]

2) Follow the kubernetes setup outlined here to generate a Stream Processor Coordinator.

3) Deploy the SP ML Worker image with the defined specification in step 1 above.

$ jobname=$(curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name     : "ml-example",
        type     : "spec",
        base     : "q-ml",
        config   : { content: $spec },
        settings : { minWorkers: "1", maxWorkers: "1" },
        env      : { AWS_ACCESS_KEY_ID      : "'"$AWS_ACCESS_KEY_ID"'",
                     AWS_SECRET_ACCESS_KEY  : "'"$AWS_SECRET_ACCESS_KEY"'",
                     AWS_REGION             : "'"$AWS_REGION"'",
                     KXI_SP_CHECKPOINT_FREQ : 0}
    }' | jq -asR .)" | jq -r .id)

4) Port-forward the SP worker

$ kubectl port-forward <INSERT_WORKER> spwork 8080

5) In a new process follow the logs of the worker process

$ kubectl logs <INSERT_WORKER> spwork -f

6) Start a q process and publish data to the SP ML Pipeline

$ q p.q
q)h:hopen 7000
q)data  :.p.import[`sklearn.datasets;`:load_breast_cancer][]
q)feat  :data[`:data]`
q)fnames:`$ssr[;" ";"_"]each data[`:feature_names][`:tolist][]`
q)tab:flip fnames!flip feat
q)h(`publish;3?tab)