PyKX introduction notebook¶

The purpose of this notebook is to introduce you to PyKX capabilities and functionality.

For the best experience, visit what is PyKX and the quickstart guide first.

To follow along, we recommend to download the notebook.

Now let's go through the following sections:

Import PyKX
Basic PyKX data structures
Access and create PyKX objects
Run analytics on PyKX objects

1. Import PyKX¶

To access PyKX and its functions, import it in your Python code as follows:

In [2]:

Copied!

import pykx as kx
kx.q.system.console_size = [10, 80]
import pykx as kx
kx.q.system.console_size = [10, 80]

Tip: We recommend to always use import pykx as kx. The shortened import name kx makes the code more readabile and is standard for the PyKX library.

Below we load additional libraries used through this notebook:

In [3]:

Copied!

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd

2. Basic PyKX data structures¶

Central to your interaction with PyKX are the data types supported by the library. PyKX is built atop the q programming language. This provides small footprint data structures for analytic calculations and the creation of highly-performant databases. The types we show below are generated from Python-equivalent types.

This section describes the basic elements in the PyKX library and explains why/how they are different:

2.1 Atom
2.2 Vector
2.3 List
2.4 Dictionary
2.5 Table
2.6 Other data types

2.1 Atom¶

In PyKX, an atom is a single irreducible value of a specific type. For example, you may come across pykx.FloatAtom or pykx.DateAtom objects which may have been generated as follows, from an equivalent Pythonic representation.

In [4]:

Copied!

kx.FloatAtom(1.0)
kx.FloatAtom(1.0)

Out[4]:

pykx.FloatAtom(pykx.q('1f'))

In [5]:

Copied!

from datetime import date
kx.DateAtom(date(2020, 1, 1))
from datetime import date
kx.DateAtom(date(2020, 1, 1))

Out[5]:

pykx.DateAtom(pykx.q('2020.01.01'))

2.2 Vector¶

Like PyKX atoms, PyKX Vectors are a data structure with multiple elements of a single type. These objects in PyKX, along with lists described below, form the basis for most of the other important data structures that you will encounter including dictionaries and tables.

Vector objects provide significant benefits when applying analytics over Python lists. Like Numpy, PyKX gains from the underlying speed of its analytic engine when operating on these strictly typed objects.

Vector type objects are always 1-D and are/can be indexed along a single axis.

In the following example, we create PyKX vectors from common Python equivalent numpy and pandas objects:

In [6]:

Copied!

kx.IntVector(np.array([1, 2, 3, 4], dtype=np.int32))
kx.IntVector(np.array([1, 2, 3, 4], dtype=np.int32))

Out[6]:

pykx.IntVector(pykx.q('1 2 3 4i'))

In [7]:

Copied!

kx.toq(pd.Series([1, 2, 3, 4]))
kx.toq(pd.Series([1, 2, 3, 4]))

Out[7]:

pykx.LongVector(pykx.q('1 2 3 4'))

2.3 List¶

A PyKX List is an untyped vector object. Unlike vectors which are optimised for the performance of analytics, lists are mostly used for storing reference information or matrix data.

Unlike vector objects which are 1-D in shape, lists can be ragged N-Dimensional objects. This makes them useful for storing complex data structures, but limits their performance when dealing with data-access/data modification tasks.

In [8]:

Copied!

kx.List([[1, 2, 3], [1.0, 1.1, 1.2], ['a', 'b', 'c']])
kx.List([[1, 2, 3], [1.0, 1.1, 1.2], ['a', 'b', 'c']])

Out[8]:

pykx.List(pykx.q('
1 2   3  
1 1.1 1.2
a b   c  
'))

2.4 Dictionary¶

A PyKX Dictionary is a mapping between a direct key-value association. The list of keys and values to which they are associated must have the same count. While it can be considered as a key-value pair, it's physically stored as a pair of lists.

In [9]:

Copied!

kx.Dictionary({'x': [1, 2, 3], 'x1': np.array([1, 2, 3])})
kx.Dictionary({'x': [1, 2, 3], 'x1': np.array([1, 2, 3])})

Out[9]:



x	1 2 3
x1	1 2 3

2.5 Table¶

PyKX Tables are a first-class typed entity which lives in memory. They're a collection of named columns implemented as a dictionary. This mapping construct means that PyKX tables are column oriented. This makes analytic operations on columns much faster than for a relational database equivalent.

PyKX Tables come in many forms, but the key table types are as follows:

pykx.Table
pykx.KeyedTable
pykx.SplayedTable
pykx.PartitionedTable

In this section we exemplify the first two, which are the in-memory data table types.

pykx.Table¶

In [10]:

Copied!

print(kx.Table([[1, 2, 'a'], [2, 3, 'b'], [3, 4, 'c']], columns = ['col1', 'col2', 'col3']))
print(kx.Table([[1, 2, 'a'], [2, 3, 'b'], [3, 4, 'c']], columns = ['col1', 'col2', 'col3']))

col1 col2 col3
--------------
1    2    a   
2    3    b   
3    4    c

In [11]:

Copied!

print(kx.Table(data = {'col1': [1, 2, 3], 'col2': [2 , 3, 4], 'col3': ['a', 'b', 'c']}))
print(kx.Table(data = {'col1': [1, 2, 3], 'col2': [2 , 3, 4], 'col3': ['a', 'b', 'c']}))

col1 col2 col3
--------------
1    2    a   
2    3    b   
3    4    c

pykx.KeyedTable¶

In [12]:

Copied!

kx.Table([[1, 2, 'a'], [2, 3, 'b'], [3, 4, 'c']],
         columns = ['col1', 'col2', 'col3'])
kx.Table([[1, 2, 'a'], [2, 3, 'b'], [3, 4, 'c']],
         columns = ['col1', 'col2', 'col3'])

Out[12]:

	col1	col2	col3

0	1	2	a
1	2	3	b
2	3	4	c

In [13]:

Copied!





kx.Table(data = {
         'col1': [1, 2, 3],
         'col2': [2 , 3, 4],
         'col3': ['a', 'b', 'c']})
kx.Table(data = {
         'col1': [1, 2, 3],
         'col2': [2 , 3, 4],
         'col3': ['a', 'b', 'c']})

Out[13]:

	col1	col2	col3

0	1	2	a
1	2	3	b
2	3	4	c

`pykx.KeyedTable`¶

pykx.KeyedTable

In [14]:

Copied!

kx.Table(data = {'x': [1, 2, 3], 'x1': [2, 3, 4], 'x2': ['a', 'b', 'c']}
         ).set_index(['x'])
kx.Table(data = {'x': [1, 2, 3], 'x1': [2, 3, 4], 'x2': ['a', 'b', 'c']}
         ).set_index(['x'])

Out[14]:

	x1	x2
x
1	2	a
2	3	b
3	4	c

2.6 Other data types¶

Below we outlined some of the important PyKX data type structures that you will run into through the rest of this notebook.

pykx.Lambda¶

A pykx.Lambda is the most basic kind of function within PyKX. They take between 0 and 8 parameters and are the building blocks for most analytics written by users when interacting with data from PyKX.

In [15]:

Copied!

pykx_lambda = kx.q('{x+y}')
type(pykx_lambda)
pykx_lambda = kx.q('{x+y}')
type(pykx_lambda)

Out[15]:

pykx.wrappers.Lambda

In [16]:

Copied!

pykx_lambda(1, 2)
pykx_lambda(1, 2)

Out[16]:

pykx.LongAtom(pykx.q('3'))

pykx.Projection¶

Like functools.partial, functions in PyKX can have some of their parameters set in advance, resulting in a new function, which is called a projection. When you call this projection, the set parameters are no longer required and cannot be provided.

If the original function had n total parameters and m provided, the result would be a function (projection) that requires the user to input n-m parameters.

In [17]:

Copied!

projection = kx.q('{x+y}')(1)
projection
projection = kx.q('{x+y}')(1)
projection

Out[17]:

pykx.Projection(pykx.q('{x+y}[1]'))

In [18]:

Copied!

projection(2)
projection(2)

Out[18]:

pykx.LongAtom(pykx.q('3'))

3. Access and create PyKX objects¶

Now that you're familiar with the PyKX object types, let's see how they work in real-world scenarios, such as:

3.1 Create PyKX objects from Pythonic data types
3.2 Random data generation
3.3 Run q code to generate data
3.4 Read data from a CSV file
3.5 Query external processes via IPC

3.1 Create PyKX objects from Pythonic data types¶

One of the most common ways to generate PyKX data is by converting equivalent Pythonic data types. PyKX natively supports conversions to and from the following common Python data formats:

Python
Numpy
Pandas
PyArrow

You can generate PyKX objects by using the kx.toq PyKX function:

In [19]:

Copied!

pydict = {'a': [1, 2, 3], 'b': ['a', 'b', 'c'], 'c': 2}
kx.toq(pydict)
pydict = {'a': [1, 2, 3], 'b': ['a', 'b', 'c'], 'c': 2}
kx.toq(pydict)

Out[19]:



a	1 2 3
b	`a`b`c
c	2

In [20]:

Copied!

nparray = np.array([1, 2, 3, 4], dtype = np.int32)
kx.toq(nparray)
nparray = np.array([1, 2, 3, 4], dtype = np.int32)
kx.toq(nparray)

Out[20]:

pykx.IntVector(pykx.q('1 2 3 4i'))

In [21]:

Copied!

pdframe = pd.DataFrame(data = {'a':[1, 2, 3], 'b': ['a', 'b', 'c']})
kx.toq(pdframe)
pdframe = pd.DataFrame(data = {'a':[1, 2, 3], 'b': ['a', 'b', 'c']})
kx.toq(pdframe)

Out[21]:

	a	b

0	1	a
1	2	b
2	3	c

3.2 Random data generation¶

PyKX provides a module to create random data of user-specified PyKX types or their equivalent Python types. The creation of random data helps in prototyping analytics.

As a first example, generate a list of 1,000,000 random floating-point values between 0 and 1 as follows:

In [22]:

Copied!

kx.random.random(1000000, 1.0)
kx.random.random(1000000, 1.0)

Out[22]:

pykx.FloatVector(pykx.q('0.3927524 0.5170911 0.5159796 0.4066642 0.1780839 0.3017723 0.785033 0.534709..'))

If you wish to choose values randomly from a list, use the list as the second argument to your function:

In [23]:

Copied!

kx.random.random(5, [kx.LongAtom(1), ['a', 'b', 'c'], np.array([1.1, 1.2, 1.3])])
kx.random.random(5, [kx.LongAtom(1), ['a', 'b', 'c'], np.array([1.1, 1.2, 1.3])])

Out[23]:

pykx.List(pykx.q('
1.1 1.2 1.3
1
1.1 1.2 1.3
1
`a`b`c
'))

Random data does not only come in 1-Dimensional forms. To create multi-Dimensional PyKX Lists, turn the first argument into a list. The following examples include a PyKX trick that uses nulls/infinities to generate random data across the full allowable range:

In [24]:

Copied!

kx.random.random([2, 5], kx.GUIDAtom.null)
kx.random.random([2, 5], kx.GUIDAtom.null)

Out[24]:

pykx.List(pykx.q('
9b19ab9c-b26d-d6b3-a8fa-267ba0620848 d8d6c050-964e-6247-e2cd-bf9435389b9a 1c4..
a68f5b00-754e-9863-04aa-8b59cc4e3122 72969cc8-4445-451b-9266-7770a60c3120 0c7..
'))

In [25]:

Copied!

kx.random.random([2, 3, 4], kx.IntAtom.inf)
kx.random.random([2, 3, 4], kx.IntAtom.inf)

Out[25]:

pykx.List(pykx.q('
1837510540 373968399  35818431  1421474592  424239201  1727064393 250148680 1..
1566069007 1773121422 2104411811 1441846567 103906494  315107819  931560883  ..
'))

Finally, to have consistency over the generated objects, set the seed for the random data generation explicitly. You can complete this globally or for individual function calls:

In [26]:

Copied!

kx.random.seed(10)
kx.random.random(10, 2.0)
kx.random.seed(10)
kx.random.random(10, 2.0)

Out[26]:

pykx.FloatVector(pykx.q('0.1782082 1.669039 0.7243899 1.999868 0.7675971 1.723838 0.1836728 0.5061767 ..'))

In [27]:

Copied!

kx.random.random(10, 2.0, seed = 10)
kx.random.random(10, 2.0, seed = 10)

Out[27]:

pykx.FloatVector(pykx.q('0.1782082 1.669039 0.7243899 1.999868 0.7675971 1.723838 0.1836728 0.5061767 ..'))

3.3 Run q code to generate data¶

PyKX is an entry point to the vector programming language q. This means that PyKX users can execute q code directly via PyKX within a Python session, by calling kx.q.

For example, to create q data, run the following command:

In [28]:

Copied!

kx.q('0 1 2 3 4')
kx.q('0 1 2 3 4')

Out[28]:

pykx.LongVector(pykx.q('0 1 2 3 4'))

In [29]:

Copied!

kx.q('([idx:desc til 5]col1:til 5;col2:5?1f;col3:5?`2)')
kx.q('([idx:desc til 5]col1:til 5;col2:5?1f;col3:5?`2)')

Out[29]:

	col1	col2	col3
idx
4	0	0.8619188	ol
3	1	0.09183638	mg
2	2	0.2530883	cm
1	3	0.2504566	cc
0	4	0.7517286	jg

Next, apply arguments to a user-specified function x+y:

In [30]:

Copied!

kx.q('{x+y}', kx.LongAtom(1), kx.LongAtom(2))
kx.q('{x+y}', kx.LongAtom(1), kx.LongAtom(2))

Out[30]:

pykx.LongAtom(pykx.q('3'))

3.4 Read data from a CSV file¶

A lot of data that you run into for data analysis tasks comes in the form of CSV files. PyKX, like Pandas, provides a CSV reader called via kx.q.read.csv. In the next cell we create a CSV that can be read in PyKX:

In [31]:

Copied!





import csv

with open('pykx.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    field = ["name", "age", "height", "country"]
    
    writer.writerow(field)
    writer.writerow(["Oladele Damilola", "40", "180.0", "Nigeria"])
    writer.writerow(["Alina Hricko", "23", "179.2", "Ukraine"])
    writer.writerow(["Isabel Walter", "50", "179.5", "United Kingdom"])
import csv

with open('pykx.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    field = ["name", "age", "height", "country"]
    
    writer.writerow(field)
    writer.writerow(["Oladele Damilola", "40", "180.0", "Nigeria"])
    writer.writerow(["Alina Hricko", "23", "179.2", "Ukraine"])
    writer.writerow(["Isabel Walter", "50", "179.5", "United Kingdom"])

In [32]:

Copied!

kx.q.read.csv('pykx.csv', types = {'age': kx.LongAtom, 'country': kx.SymbolAtom})
kx.q.read.csv('pykx.csv', types = {'age': kx.LongAtom, 'country': kx.SymbolAtom})

Out[32]:

	name	age	height	country

0	"Oladele Damilola"	40	180e	Nigeria
1	"Alina Hricko"	23	179.2e	Ukraine
2	"Isabel Walter"	50	179.5e	United Kingdom

In [33]:

Copied!

import os
os.remove('pykx.csv')
import os
os.remove('pykx.csv')

3.5 Query external processes via IPC¶

One of the most common usage patterns in organizations with access to data in kdb+/q is to query data from an external server process infrastructure. For the example below you need to install q.

First, set up a q/kdb+ server. Set it on port 5050 and populate it with some data in the form of a table tab:

In [34]:

Copied!





import subprocess
import time

try:
    with kx.PyKXReimport():
        proc = subprocess.Popen(
            ('q', '-p', '5000')
        )
    time.sleep(2)
except:
    raise kx.QError('Unable to create q process on port 5000')
import subprocess
import time

try:
    with kx.PyKXReimport():
        proc = subprocess.Popen(
            ('q', '-p', '5000')
        )
    time.sleep(2)
except:
    raise kx.QError('Unable to create q process on port 5000')

Once a q process is available, connect to it for synchronous query execution:

In [35]:

Copied!

conn = kx.SyncQConnection(port = 5000)
conn = kx.SyncQConnection(port = 5000)

You can now run q commands against the q server:

In [36]:

Copied!

conn('tab:([]col1:100?`a`b`c;col2:100?1f;col3:100?0Ng)')
conn('select from tab where col1=`a')
conn('tab:([]col1:100?`a`b`c;col2:100?1f;col3:100?0Ng)')
conn('select from tab where col1=`a')

Out[36]:

	col1	col2	col3

0	a	0.01974141	ddb87915-b672-2c32-a6cf-296061671e9d
1	a	0.5611439	580d8c87-e557-0db1-3a19-cb3a44d623b1
2	a	0.8685452	2d948578-e9d6-79a2-8207-9df7a71f0b3b
3	a	0.3460797	cddeceef-9ee9-3847-9172-3e3d7ab39b26
4	a	0.5046331	1c22a468-9492-2173-9e4f-9003a23d02b7
5	a	0.765905	5e9cd21b-88c5-bbf5-7215-6409e115a2a4
6	a	0.8794685	3462beab-42ee-ccad-989b-8d69f070dffc
7	a	0.02487862	bc150163-c551-0eba-8871-9767f5c0e3d5
8	a	0.3664924	dd6b4a2b-c046-e464-a0b9-efb96ed5f0eb
...	...	...	...
36	a	0.9929108	03a9b290-95c8-c3b8-fb9a-9ac9874763b8

37 rows × 3 columns

Alternatively, use the PyKX query API:

In [37]:

Copied!

conn.qsql.select('tab', where=['col1=`a', 'col2<0.3'])
conn.qsql.select('tab', where=['col1=`a', 'col2<0.3'])

Out[37]:

	col1	col2	col3

0	a	0.01974141	ddb87915-b672-2c32-a6cf-296061671e9d
1	a	0.02487862	bc150163-c551-0eba-8871-9767f5c0e3d5
2	a	0.2073435	ee853957-d502-d30d-5945-bf8c97022332
3	a	0.2188574	d9a3e171-b1cf-0271-507a-0fba0b52e6ff
4	a	0.1451855	ea4d0269-375c-d73b-96f0-6bb6334ca423
5	a	0.1497004	1cce6bdd-e34b-ba4f-8c01-31d098d81221
6	a	0.166486	6417d4b3-3fc6-e35a-1c34-8c5c3327b1e8
7	a	0.2643322	f294c3cb-a6da-e15d-c8e0-3a848d2abf10
8	a	0.07841939	020715aa-8ffa-e1d3-9c68-3ad7919d4f5e
9	a	0.08077328	65b2f5b0-918c-b87b-4fc4-4aa24b192476

Or use PyKX's context interface to run SQL server side if you have access to it:

In [38]:

Copied!

conn('\l s.k_')
conn.sql('SELECT * FROM tab where col2>=0.5')
conn('\l s.k_')
conn.sql('SELECT * FROM tab where col2>=0.5')

Out[38]:

	col1	col2	col3

0	a	0.5611439	580d8c87-e557-0db1-3a19-cb3a44d623b1
1	a	0.8685452	2d948578-e9d6-79a2-8207-9df7a71f0b3b
2	b	0.7716917	52cb20d9-f12c-9963-2829-3c64d8d8cb14
3	a	0.5046331	1c22a468-9492-2173-9e4f-9003a23d02b7
4	c	0.6014692	7ea4d431-4dec-3017-3d13-cc9ef7f1c0ee
5	c	0.5000071	782c5346-f5f7-b90e-c686-8d41fa85233b
6	c	0.8392881	245f5516-0cb8-391a-e1e5-fadddc8e54ba
7	b	0.5938637	e30bab29-2df0-3fb0-535f-58d1e7bd83c0
8	a	0.765905	5e9cd21b-88c5-bbf5-7215-6409e115a2a4
...	...	...	...
55	b	0.8236115	f2c41bca-67df-aa6c-4730-bca38cbd6825

56 rows × 3 columns

Finally, shut down the q server used for this demonstration:

In [39]:

Copied!

proc.kill()
proc.kill()

4. Run analytics on PyKX objects¶

Like many Python libraries (including Numpy and Pandas), PyKX provides many ways to use its data with analytics that you generated and defined within the library. Let's explore the following:

4.1 Use in-built methods on PyKX Vectors
4.2 Use in-built methods on PyKX Tables
4.3 Use PyKX/q native functions

4.1 Use in-built methods on PyKX Vectors¶

When you interact with PyKX Vectors, you may wish to gain insights into these objects through the application of basic analytics such as calculation of the mean/median/mode of the vector:

In [40]:

Copied!

q_vector = kx.random.random(1000, 10.0)
q_vector = kx.random.random(1000, 10.0)

In [41]:

Copied!

q_vector.mean()
q_vector.mean()

Out[41]:

pykx.FloatAtom(pykx.q('4.984157'))

In [42]:

Copied!

q_vector.max()
q_vector.max()

Out[42]:

pykx.FloatAtom(pykx.q('9.998212'))

The above is useful for basic analysis. For bespoke analytics on these vectors, use the apply method:

In [43]:

Copied!

def bespoke_function(x, y):
    return x*y

q_vector.apply(bespoke_function, 5)
def bespoke_function(x, y):
    return x*y

q_vector.apply(bespoke_function, 5)

Out[43]:

pykx.FloatVector(pykx.q('31.74132 38.3376 46.40922 10.17963 38.73944 48.33864 41.12562 45.44382 32.290..'))

4.2 Use in-built methods on PyKX Tables¶

In addition to the vector processing capabilities of PyKX, it's important to have the ability to manage tables. Highlighted in depth within the Pandas-Like API documentation here, these methods allow you to apply functions and gain insights into your data in a familiar way.

The example below uses combinations of the most used elements of this Table API operating on the following table:

In [44]:

Copied!





N = 1000000
example_table = kx.Table(data = {
    'sym' : kx.random.random(N, ['a', 'b', 'c']),
    'col1' : kx.random.random(N, 10.0),
    'col2' : kx.random.random(N, 20)
    }
)
example_table
N = 1000000
example_table = kx.Table(data = {
    'sym' : kx.random.random(N, ['a', 'b', 'c']),
    'col1' : kx.random.random(N, 10.0),
    'col2' : kx.random.random(N, 20)
    }
)
example_table

Out[44]:

	sym	col1	col2

0	b	7.782944	6
1	c	0.5899977	17
2	c	2.580528	8
3	b	5.651351	10
4	b	2.336329	11
5	b	2.87167	17
6	c	9.705893	9
7	a	5.729889	8
8	c	1.482026	14
...	...	...	...
999999	c	8.862285	6

1,000,000 rows × 3 columns

You can search for and filter data within your tables using loc similarly to how this is achieved by Pandas:

In [45]:

Copied!

example_table.loc[example_table['sym'] == 'a']
example_table.loc[example_table['sym'] == 'a']

Out[45]:

	sym	col1	col2

0	a	5.729889	8
1	a	4.396508	13
2	a	0.7636906	19
3	a	9.904306	17
4	a	1.439738	10
5	a	2.898631	19
6	a	2.360396	2
7	a	1.932728	12
8	a	4.877998	4
...	...	...	...
332823	a	6.653308	18

332,824 rows × 3 columns

This also happens when retrieving data from a table through the __get__ method:

In [46]:

Copied!

example_table[example_table['sym'] == 'b']
example_table[example_table['sym'] == 'b']

Out[46]:

	sym	col1	col2

0	b	7.782944	6
1	b	5.651351	10
2	b	2.336329	11
3	b	2.87167	17
4	b	2.917054	2
5	b	7.093562	18
6	b	1.715391	10
7	b	4.231884	0
8	b	4.727296	2
...	...	...	...
333014	b	9.361253	17

333,015 rows × 3 columns

Next, you can set the index columns of a table. In PyKX, this means converting the table from a pykx.Table object to a pykx.KeyedTable object:

In [47]:

Copied!

example_table.set_index('sym')
example_table.set_index('sym')

Out[47]:

	col1	col2
sym
b	7.782944	6
c	0.5899977	17
c	2.580528	8
b	5.651351	10
b	2.336329	11
b	2.87167	17
c	9.705893	9
a	5.729889	8
c	1.482026	14
...	...	...
c	8.862285	6

1,000,000 rows × 3 columns

Or you can apply basic data manipulation operations such as mean and median:

In [48]:

Copied!

print('mean:')
display(example_table.mean(numeric_only = True))

print('median:')
display(example_table.median(numeric_only = True))
print('mean:')
display(example_table.mean(numeric_only = True))

print('median:')
display(example_table.median(numeric_only = True))

mean:



col1	4.998412
col2	9.497452

median:



col1	4.996685
col2	9f

Next, use the groupby method to group PyKX tabular data so you can use it for analytic purposes.

In the first example, let's start by grouping the dataset based on the sym column and calculate the mean for each column based on their sym:

In [49]:

Copied!

example_table.groupby('sym').mean()
example_table.groupby('sym').mean()

Out[49]:

	col1	col2
sym
a	5.00519	9.49375
b	5.000742	9.501077
c	4.989338	9.497527

To extend the above groupby, consider a more complex example which uses numpy to run calculations on the PyKX data. You will notice later that you can simplify this specific use-case further.

In [50]:

Copied!

def apply_func(x):
    nparray = x.np()
    return np.sqrt(nparray).mean()

example_table.groupby('sym').apply(apply_func)
def apply_func(x):
    nparray = x.np()
    return np.sqrt(nparray).mean()

example_table.groupby('sym').apply(apply_func)

Out[50]:

	col1	col2
sym
a	2.109397	2.859095
b	2.108571	2.860037
c	2.105694	2.859527

For time-series specific joining of data, use merge_asof joins. In this example, you have several tables with temporal information namely a trades and quotes table:

In [51]:

Copied!





trades = kx.Table(data={
    "time": [
        pd.Timestamp("2016-05-25 13:30:00.023"),
        pd.Timestamp("2016-05-25 13:30:00.023"),
        pd.Timestamp("2016-05-25 13:30:00.030"),
        pd.Timestamp("2016-05-25 13:30:00.041"),
        pd.Timestamp("2016-05-25 13:30:00.048"),
        pd.Timestamp("2016-05-25 13:30:00.049"),
        pd.Timestamp("2016-05-25 13:30:00.072"),
        pd.Timestamp("2016-05-25 13:30:00.075")
    ],
    "ticker": [
       "GOOG",
       "MSFT",
       "MSFT",
       "MSFT",
       "GOOG",
       "AAPL",
       "GOOG",
       "MSFT"
   ],
   "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
   "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
})
quotes = kx.Table(data={
   "time": [
       pd.Timestamp("2016-05-25 13:30:00.023"),
       pd.Timestamp("2016-05-25 13:30:00.038"),
       pd.Timestamp("2016-05-25 13:30:00.048"),
       pd.Timestamp("2016-05-25 13:30:00.048"),
       pd.Timestamp("2016-05-25 13:30:00.048")
   ],
   "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   "quantity": [75, 155, 100, 100, 100]
})

print('trades:')
display(trades)
print('quotes:')
display(quotes)
trades = kx.Table(data={
    "time": [
        pd.Timestamp("2016-05-25 13:30:00.023"),
        pd.Timestamp("2016-05-25 13:30:00.023"),
        pd.Timestamp("2016-05-25 13:30:00.030"),
        pd.Timestamp("2016-05-25 13:30:00.041"),
        pd.Timestamp("2016-05-25 13:30:00.048"),
        pd.Timestamp("2016-05-25 13:30:00.049"),
        pd.Timestamp("2016-05-25 13:30:00.072"),
        pd.Timestamp("2016-05-25 13:30:00.075")
    ],
    "ticker": [
       "GOOG",
       "MSFT",
       "MSFT",
       "MSFT",
       "GOOG",
       "AAPL",
       "GOOG",
       "MSFT"
   ],
   "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
   "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
})
quotes = kx.Table(data={
   "time": [
       pd.Timestamp("2016-05-25 13:30:00.023"),
       pd.Timestamp("2016-05-25 13:30:00.038"),
       pd.Timestamp("2016-05-25 13:30:00.048"),
       pd.Timestamp("2016-05-25 13:30:00.048"),
       pd.Timestamp("2016-05-25 13:30:00.048")
   ],
   "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   "quantity": [75, 155, 100, 100, 100]
})

print('trades:')
display(trades)
print('quotes:')
display(quotes)

trades:

	time	ticker	bid	ask

0	2016.05.25D13:30:00.023000000	GOOG	720.5	720.93
1	2016.05.25D13:30:00.023000000	MSFT	51.95	51.96
2	2016.05.25D13:30:00.030000000	MSFT	51.97	51.98
3	2016.05.25D13:30:00.041000000	MSFT	51.99	52f
4	2016.05.25D13:30:00.048000000	GOOG	720.5	720.93
5	2016.05.25D13:30:00.049000000	AAPL	97.99	98.01
6	2016.05.25D13:30:00.072000000	GOOG	720.5	720.88
7	2016.05.25D13:30:00.075000000	MSFT	52.01	52.03

quotes:

	time	ticker	price	quantity

0	2016.05.25D13:30:00.023000000	MSFT	51.95	75
1	2016.05.25D13:30:00.038000000	MSFT	51.95	155
2	2016.05.25D13:30:00.048000000	GOOG	720.77	100
3	2016.05.25D13:30:00.048000000	GOOG	720.92	100
4	2016.05.25D13:30:00.048000000	AAPL	98f	100

When applying the asof join, you can additionally use named arguments to make a distinction between the tables that the columns originate from. In this case, suffix with _trades and _quotes:

In [52]:

Copied!

trades.merge_asof(quotes, on='time', suffixes=('_trades', '_quotes'))
trades.merge_asof(quotes, on='time', suffixes=('_trades', '_quotes'))

Out[52]:

	time	ticker_trades	bid	ask	ticker_quotes	price	quantity

0	2016.05.25D13:30:00.023000000	GOOG	720.5	720.93	MSFT	51.95	75
1	2016.05.25D13:30:00.023000000	MSFT	51.95	51.96	MSFT	51.95	75
2	2016.05.25D13:30:00.030000000	MSFT	51.97	51.98	MSFT	51.95	75
3	2016.05.25D13:30:00.041000000	MSFT	51.99	52f	MSFT	51.95	155
4	2016.05.25D13:30:00.048000000	GOOG	720.5	720.93	AAPL	98f	100
5	2016.05.25D13:30:00.049000000	AAPL	97.99	98.01	AAPL	98f	100
6	2016.05.25D13:30:00.072000000	GOOG	720.5	720.88	AAPL	98f	100
7	2016.05.25D13:30:00.075000000	MSFT	52.01	52.03	AAPL	98f	100

4.3 Use PyKX/q native functions¶

While the Pandas-like API and methods provided off PyKX Vectors provides an effective method of applying analytics on PyKX data, the most efficient and performant way to run analytics on your data is by using PyKX/q primitives available through the kx.q module.

These include functionality for calculating moving averages, asof/window joins, column reversal etc. Now let's see a few examples with how you can use these functions, grouped into the following sections:

4.3.1 Mathematical functions
4.3.2 Iteration functions
4.3.3 Table functions

4.3.1 Mathematical functions¶

mavg¶

Calculate a series of average values across a list using a rolling window:

In [53]:

Copied!

kx.q.mavg(10, kx.random.random(10000, 2.0))
kx.q.mavg(10, kx.random.random(10000, 2.0))

Out[53]:

pykx.FloatVector(pykx.q('1.469756 1.029263 0.7352848 0.5950915 0.7071875 0.8486546 0.910078 0.95322 1...'))

cor¶

Calculate the correlation between two lists:

In [54]:

Copied!

kx.q.cor([1, 2, 3], [2, 3, 4])
kx.q.cor([1, 2, 3], [2, 3, 4])

Out[54]:

pykx.FloatAtom(pykx.q('1f'))

In [55]:

Copied!

kx.q.cor(kx.random.random(100, 1.0), kx.random.random(100, 1.0))
kx.q.cor(kx.random.random(100, 1.0), kx.random.random(100, 1.0))

Out[55]:

pykx.FloatAtom(pykx.q('0.02687833'))

prds¶

Calculate the cumulative product across a supplied list:

In [56]:

Copied!

kx.q.prds([1, 2, 3, 4, 5])
kx.q.prds([1, 2, 3, 4, 5])

Out[56]:

pykx.LongVector(pykx.q('1 2 6 24 120'))

4.3.2 Iteration functions¶

each¶

Supplied both as a standalone primitive and as a method for PyKX Lambdas each allows you to pass individual elements of a PyKX object to a function:

In [57]:

Copied!

kx.q.each(kx.q('{prd x}'), kx.random.random([5, 5], 10.0, seed=10))
kx.q.each(kx.q('{prd x}'), kx.random.random([5, 5], 10.0, seed=10))

Out[57]:

pykx.FloatVector(pykx.q('1033.597 377.1784 7126.713 418.3232 89.97531'))

In [58]:

Copied!

kx.q('{prd x}').each(kx.random.random([5, 5], 10.0, seed=10))
kx.q('{prd x}').each(kx.random.random([5, 5], 10.0, seed=10))

Out[58]:

pykx.FloatVector(pykx.q('1033.597 377.1784 7126.713 418.3232 89.97531'))

4.3.3 Table functions¶

meta¶

Retrieve metadata information about a table:

In [59]:

Copied!





qtab = kx.Table(data = {
    'x' : kx.random.random(1000, ['a', 'b', 'c']).grouped(),
    'y' : kx.random.random(1000, 1.0),
    'z' : kx.random.random(1000, kx.TimestampAtom.inf)
})
qtab = kx.Table(data = {
    'x' : kx.random.random(1000, ['a', 'b', 'c']).grouped(),
    'y' : kx.random.random(1000, 1.0),
    'z' : kx.random.random(1000, kx.TimestampAtom.inf)
})

In [60]:

Copied!

kx.q.meta(qtab)
kx.q.meta(qtab)

Out[60]:

	t	a
c
x	"s"	g
y	"f"
z	"p"

xasc¶

Sort the contents of a specified column in ascending order:

In [61]:

Copied!

kx.q.xasc('z', qtab)
kx.q.xasc('z', qtab)

Out[61]:

	x	y	z

0	c	0.2660419	2000.09.17D00:27:33.222932480
1	b	0.2378591	2001.02.01D19:58:48.496586752
2	c	0.05802967	2001.05.29D15:29:16.181340160
3	c	0.9474748	2003.03.24D08:12:02.975653888
4	b	0.02726729	2004.01.31D07:25:21.959215104
5	b	0.08927731	2004.12.31D23:50:54.425055232
6	c	0.2256163	2005.07.12D10:45:38.423119872
7	b	0.1675316	2006.04.19D21:31:40.507750400
8	b	0.8185412	2006.05.28D15:22:24.331161600
...	...	...	...
999	a	0.4414727	2292.03.15D06:41:24.638662656

1,000 rows × 3 columns

You can find the full list of the functions and some examples of their usage here.