Get started with KX for Databricks

This page explains how to deploy and get started with KX for Databricks.

Install PyKX
Licensing
Load data using Spark Dataframes
PyKX pythonic vs. q magic (%%)

1. Install PyKX

Install PyKX on the cluster using one of the following methods:

Install PyKX from PyPIInstall PyKX from AnacondaInstall PyKX from the Python Wheel

Create a new notebook and install using pip:

pip install pykx

Create a new notebook and install using conda:

conda install -c kx pykx

Download the Python Wheel file from the Download Portal and store it on the cluster:

> ls /path/to/folder
pykx-2.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Create a new notebook and install from the wheel file:

pip install /path/to/folder/pykx-2.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Tip: After installing the packages, you may need to restart the kernel by running the following command:

dbutils.library.restartPython()

2. Licensing

To use all PyKX functionalities, you need to download and install a KDB Insights license.

Choose a license type:

Personal LicenseCommercial License

For personal usage, navigate to Personal License Download and complete the form.

For commercial usage, contact your KX sales representative, email sales@kx.com, or apply through Book demo.

More on the license installation process.
On receipt of an email from KX, download and save the kc.lic license file to a secure location on your cluster:
```
> ls /path/to/folder
kc.lic
```
Set an environment variable QLIC pointing to the folder containing the license file:
```
import os
os.environ['QLIC'] = '/path/to/folder'
```
That's it! To verify if you successfully installed PyKX, run the following:
```
import pykx as kx
print('PyKX Version: ', kx.__version__)
```
You should have the following output:
```
PyKX Version:  2.5.1
```
Note: The license requires internet connectivity.

Update the workspace security policies to whitelist the following:
- The remote servers should treat kdb+ on-demand as a CDN-fronted service and employ DNS-based filtering.
- The current list of hostnames required for kdb+ on-demand operation is {h,g}.kdbod.{com,net,org}.

3. Load data using Spark Dataframes

a. From the Databricks Unity Catalogue:

dfTrades = spark.table('<name of catalogue>.<name of schema>.<name of table>')
dfQuotes = spark.table('<name of catalogue>.<name of schema>.<name of table>')

b. Directly from a CSV file:

dfTrades = spark.read.format("csv").option("header", "true").option("delimiter", ",").load("/path/to/file")
dfQuotes = spark.read.format("csv").option("header", "true").option("delimiter", ",").load("/path/to/file")

To convert the Spark Dataframe to a Pandas Dataframe, use the toPandas() function:

pandas_trade = dfTrades.toPandas()
pandas_quote = dfQuotes.toPandas()

Convert the Pandas Dataframe to a q table with the toq() function:

pykx_trades = kx.toq(pandas_trade)
pykx_quotes = kx.toq(pandas_quote)

4. PyKX pythonic vs. q magic (%%)

Reference q functions in PyKX with the standard python syntax:

kx.q('til 10')

This produces a sequence of numbers from 0 to 9:

pykx.LongVector(pykx.q('0 1 2 3 4 5 6 7 8 9'))

Another way to achieve this is by using native q by leveraging q magic (%%q) :

%%q
til 10

You will get the same output:

0 1 2 3 4 5 6 7 8 9

Simple load and query example using PyKX

Consider a samples catalogue with a finance schema that has a trades table. Load data from the table into a Spark Dataframe dfTrade:

dfTrade = spark.table('samples.finance.trades')

Next, convert the Spark Dataframe to a Pandas Dataframe panda_trade using the toPandas() function:

panda_trade = dfTrade.toPandas()

Using the toq() function to convert the Pandas Dataframe to a q table pykx_trade:

pykx_trade = kx.toq(panda_trade)

To perform a simple OHLC query on the data, use regular python syntax:

kx.q("{select Open:first price,High:max price,Low:min price,Close:last price by bucket:0D01 xbar time,sym from x}", pykx_trade).head()

This should produce the following output:

bucket                          sym Open        High        Low         Close
2017.08.31D00:00:00.000000000   A   347.8271    349.256     346.0935    348.2853
                                AAN 347.4742    349.1463    346.2195    348.3558
                                ACA 347.4005    348.4313    346.5526    347.2568
                                ACP 346.8897    350.9313    345.8342    347.8243

You can achieve the same result with q magic by storing the table:

kx.q['pykx_trade'] = pykx_trade

and performing the same query:

%%q
select Open:first price,High:max price,Low:min price,Close:last price by bucket:0D01 xbar time,sym from pykx_trade

In the end, you get the same output:

bucket                        sym | Open     High     Low      Close   
----------------------------------| -----------------------------------
2017.08.31D00:00:00.000000000 A   | 347.8271 349.256  346.0935 348.2853
2017.08.31D00:00:00.000000000 AAN | 347.4742 349.1463 346.2195 348.3558
2017.08.31D00:00:00.000000000 ACA | 347.4005 348.4313 346.5526 347.2568
2017.08.31D00:00:00.000000000 ACP | 346.8897 350.9313 345.8342 347.8243
2017.08.31D00:00:00.000000000 ACRE| 349.8609 351.621  345.0562 345.9471