Package Content
As discussed here, packages contain the following fundamental components presently:
- A manifest file describing the content of a package and its dependencies.
- An initialization script defining the file/files to be loaded when a package is loaded.
- Arbitrary code written in Python/q which can be used when loaded by a package.
- Within your source code User-Defined Functions (UDFs), as described below, can be used to define functions of particular significance, namely those which are intended to be deployed directly to a pipeline.
Package Structure
A package in its simplest form generated by the CLI consists of the following structure:
$ kxi package init test_package
$ tree test_package
test_package
├── init.q
└── manifest.json
You can extend this to contain arbitrary code; for example, in the following more complex package containing code in an arbitrary structure provided the manifest.json
file exists in-place:
└── ml
├── init.q
├── manifest.json
├── ml.q
└── machine_learning
├── preprocessing
│ └── preproc.q
├── model.py
└── model.q
Using the APIs provided you can then load the code as described within the q and Python API outlines.
Manifest File
The manifest.json
file which is present when a new package is initialized is centrally important to the use of a package. Without a defined manifest.json
file, a package cannot be used by kdb Insights Enterprise or the APIs provided for package interaction.
On initialization of a package you are presented with a manifest.json
file with the following structure:
{
"name": "qpackage",
"version": "0.0.1",
"license": "",
"dependencies": {},
"entrypoints": {
"default": "init.q"
},
"metadata": {
"description": "",
"authors": {
"bob": {
"email": ""
}
},
"entitlements": {}
},
"system": {
"_pakx_version": "1.1.0"
},
"databases": {},
"pipelines": {},
"udfs": {
"names": [
"udf"
]
}
}
The following table provides a brief description of each configurable section within the manifest.json
and whether definition of its content is required for the package to be used effectively.
Warning
The keys defined at initialization must not be deleted.
section | description | required |
---|---|---|
name |
The name associated with the package by default when building it. | yes |
version |
The version associated with the package by default when building it. | yes |
entrypoints |
The set of possible methods by which a package can be loaded. More information available here. | yes |
license |
The relative path to the license file under which the defined package is intended to be released. | no |
metadata |
Information about the package contents and the users who have contributed to it. | no |
dependencies |
Any explicit dependencies on additional packages. More information available here. | no |
system |
Information about the system conditions under which the package was generated. Presently this includes the version of the pakx Python package which was used to create the package. |
no |
pipelines |
Currently unused, this denotes any Stream Processor pipeline definitions within a package. | no |
databases |
Currently unused, this denotes the databases that have been defined within a package. | no |
udfs.names |
This denotes the tagged names which are searched when parsing the package text for user-defined functions. | yes |
Entrypoints
Entrypoints define the q/Python files which can be used as the initialization script for a package. The default entrypoint used when loading a package is default
and is defined as init.q
, this file is used when a package is loaded with no specific entrypoint defined. You can update this entrypoint to be any file relative to the package root i.e.
"entrypoints": {
"default": "src/init.q"
},
For more advanced usage you can specify multiple entrypoints for a package, allowing sub-sections of a code-base to be loaded independently. This is particularly useful when attempting to split code based on the area of an application it is intended to be used within, for example the following could define entrypoints specific to the pipelines and data access processes and aggregators separately.
"entrypoints": {
"default": "init.q",
"sp": "src/sp.q",
"data-access": "src/da.q",
"aggregator": "src/agg.q"
},
The API which provides you with the ability to load specific entrypoints is defined in q here and in Python here.
Warning
The use of Python entrypoints is currently a beta feature and still in active development. It is supported only when using the Python API independently of kdb Insights Enterprise; for example, when developing packages for use within kdb Insights Enterprise, entrypoints must at present be defined with a *.q
extension.
Dependencies
The dependencies section of the manifest.json
file outlines any external dependencies on which the package being defined is explicitly dependent. The expected structure for defining dependencies is as follows:
"dependencies": {
"package": {
"location" : "",
"repo": "",
"version": ""
}
}
The keys within this dependency structure relate to the following:
key | description |
---|---|
package |
The name of the package to be retrieved as a dependency. |
location |
The storage location from which a package is to be retrieved, one of local , github , gitlab or kx-nexus . |
repo |
The repository/path location from which the dependency is to be retrieved. |
version |
The version of the package dependency-name which is to be retrieved as a dependency. |
For completeness we will outline each location
option separately and the underlying structure of the request completed when retrieving the requested dependency.
Required environment variables:
GITHUB_TOKEN
this token is required to allow a user download artifacts from github and can be generated by following the instructions outlined here
The following is an example request which would download the package test-package.1.0.0.kxi
based on a tagged release 1.0.0
of the repository github.com/test_user/test_repo
.
"test-package": {
"location": "github",
"repo": "test_user/test_repo",
"version": "1.0.0"
}
The underlying URL against which this request is executed is as following
https://github.com/{package.repo}/release/download/{package.version}/{package.name}-{package.version}.kxi
Required environment variables:
GITLAB_TOKEN
this token is required to allow a user download artifacts from gitlab and can be generated by following the instructions outline here
The following is an example request which would download the package test-package.1.0.0.kxi
based on a tagged release 1.0.0
of the repository https://gitlab.com/test_user/test_repo
.
"test-package": {
"location": "gitlab",
"repo": "test_user/test_repo",
"version": "1.0.0"
}
The underlying URL against which this request is executed is as following
https://gitlab.com/api/v4/projects/{package.repo}/packages/generic/{package.name}/{package.version}/{package.name}-{package.version}.kxi
Required environment variables:
KX_NEXUS_USER
The username associated with a users access to the KX External NexusKX_NEXUS_PASS
The password associated with a users access to the KX External Nexus
The following is an example request which would download the package test-package.1.0.0.kxi
based on a tagged release 1.0.0
stored at the location test_root/test_package
within the packages store for the KX Nexus.
"test-package": {
"location": "kx-nexus",
"repo": "test_user/test_repo",
"version": "1.0.0"
}
Configuration of additional dependencies options can be done within the ~/.insights/pakx-config
configuration file which allows for modifications to the retrieval locations and added sources.
Adding Local dependencies
When adding local dependencies there are a few constraints:
- Only
kxi
files can be added as local dependencies - These can be referenced using
path
with should be the absolute filepath
The following is an example request which would find test-package.1.0.0.kxi
based on a tagged release 1.0.0
stored at the location path/to/
on the local host.
"test-package": {
"path": "path/to/test-package-1.0.0.kxi"
}
Note
The version in the manifest.json
will take precedence over the version in the filepath
User-Defined Functions
User-Defined Functions (UDFs) are functions written in Python or q which have special meaning within kdb Insights Enterprise. They are used in the deployment of named functions from a package to a pipeline. The addition of UDFs is motivated by the need for many users to define analytics in a streaming context while abstracting the underlying implementation logic and language used to define the UDF. This can be particularly useful in organizations with limited numbers of either q or Python developers who wish to make the most of their development resources by allowing experts in these languages to define functionality that can be used by users of the other.
Within the kdb Insights Enterprise UDFs are presently supported for use within a pipeline as the input to any of the functions nodes map
, filter
, merge
, split
etc allowing a user to specify persisted custom logic to be associated with a pipeline.
Defining a UDF
You can define UDFs within packages through the use of comments in q and decorators in Python. These constructs provide an association between the configuration of a UDF and the function linked with the UDF. In each case the following general construct is used:
// @udf.*
from kxi.packages.decorators import udf
@udf.*
Where in each case *
within the definition @udf.*
can be one of the following:
value | description | required | default |
---|---|---|---|
name |
The name by which the underlying UDF will be associated when referenced by Insights APIs. | yes |
N/A |
description |
A user supplied description allowing a user to discern the motivation for the UDF. | no |
"" |
tag |
A user specified tag outlining where in an Insights deployment the UDF is to be used, this information is not actioned but defined to allow segmentation of user code. | no |
"" |
category |
A user specified category/list of categories which can be used to define within a tagged section of the Insights deployment where the UDF is to be deployed for example @udf.category(["map", "filter"]) to define usage within a map and filter node of a Pipeline. |
no |
"" |
The following provides examples of a number of fully defined UDFs within each language:
q
// @udf.name("custom_map")
// @udf.description("Custom map function providing filtering against incoming data for a specified column and maximum threshold.")
// @udf.tag("sp")
// @udf.category("map")
.test.my_custom_udf:{[table;params]
select from table where params[`column]>params`threshold
}
// @udf.name("custom_map")
.test.my_custom_udf:{[table;params]
select from table where params[`column]>params`threshold
}
Python
import kxi.packages as pakx
from pakx.decorators import udf
import numpy as np
@udf.name('custom_py_map')
@udf.description('Custom Python UDF making use of numpy')
@udf.tag('sp')
@udf.category('map')
def py_udf(table, params):
mod_column = table[params['column']]
# Multiply the content of the column to be modified by random values between 0 and 1
table[params['column']] = mod_column * np.random.random_sample(len(mod_column),)
return(table)
import kxi.packages as pakx
from pakx.decorators import udf
import numpy as np
@udf.name('custom_py_map')
def py_udf(table, params):
mod_column = table[params['column']]
# Multiply the content of the column to be modified by random values between 0 and 1
table[params['column']] = mod_column * np.random.random_sample(len(mod_column),)
return(table)
Usage
As noted above, presently UDFs can be used within a Stream Processor pipeline. This is supported within kdb Insights Enterprise within the drag and drop pipeline UI or via the definition of pipelines in the Query window.
Within the context of the Pipeline, UDFs are retrieved using the .qsp.udf
and qsp.udf
functions in q and Python respectively.
For examples of their usage see the kdb Insights Enterprise quickstart guide here.
Constraints
The definition of your UDFs comes with the following constraints:
- A UDF must take two or more parameters with a maximum of eight parameters supported.
- The final parameter in the UDF is a reserved parameter (thus the maximum number of non reserved user parameters is seven) used to modify the UDF behavior for execution. When loading a UDF within a pipeline, this parameter is auto populated as an empty dictionary unless otherwise specified.
- If defined in q, the function which is to be defined as a UDF must be presented beneath the relevant comment block to which it is associated with its full namespace definition, namely:
\d .test
pi:3.14
square:{x wsum x}
// @udf.name("test")
// @udf.description("This is correct as UDF will be resolved in correct namespace")
.test.user_defined_function:{[data;params]pi*square data}
\d .test
pi:3.14
square:{x wsum x}
// @udf.name("test")
// @udf.description("This is incorrect as UDF will not resolve .test namespace")
user_defined_func:{[data;params]pi*square data}
Custom UDF definitions
In the above examples, all UDFs, in both q and python, have been defined using the syntax @udf
.
udf
is the default keyword used to define UDFs, however, it is possible to define UDFs using a custom keyword, for example @myudf
could be used.
Here are some examples:
// @myudf.name("custom_map")
// @myudf.description("Custom map function providing filtering against incoming data for a specified column and maximum threshold.")
// @myudf.tag("sp")
// @myudf.category("map")
.test.my_custom_udf:{[table;params]
select from table where params[`column]>params`threshold
}
from kxi.packages.decorators import udf as myudf
@myudf.name('custom_py_map')
@myudf.description('Custom Python UDF making use of numpy')
@myudf.tag('sp')
@myudf.category('map')
def py_udf(table, params):
mod_column = table[params['column']]
# Multiply the content of the column to be modified by random values between 0 and 1
table[params['column']] = mod_column * np.random.random_sample(len(mod_column),)
return(table)
In order to list and load UDFs defined using custom keywords, a udf_sym
(or list of such symbols) needs to be passed to the listing functions alongside the path. Further details on this are described in the API sections for Python and q.
Note
All keywords used to define UDFs within a package must be added to the udfs
section in the packages manifest file. This is important for deployment as any UDFs defined using keywords that are not listed in the manifest file are not retrievable.