# PyKX Introduction Notebook

The purpose of this notebook is to provide an introduction to the capabilities and functionality made available to you with PyKX.

To follow along please download this notebook using the following <a href="./PyKX%20Introduction%20Notebook.ipynb" download>'link.'</a>

This Notebook is broken into the following sections

1. [How to import PyKX](#How-to-import-Pykx)
1. [The basic data structures of PyKX](#The-basic-data-structures-of-PyKX)
1. [Accessing and creating PyKX objects](#Accessing-and-creating-PyKX-objects)
1. [Running analytics on objects in PyKX](#Running-analytics-on-objects-in-PyKX)

## Welcome to PyKX!

PyKX is a Python library built and maintained for interfacing seamlessly with the worlds fastest time-series database technology kdb+ and it's underlying vector programming language q.

It's aim is to provide you and all Python data-engineers and data-scientist with an interface to efficiently apply analytics on large volumes of on-disk and in-memory data, in a fraction of the time of competitor libraries.

## How to import PyKX

To access PyKX and it's functions import it in your Python code as follows

In [None]:
import os
os.environ['PYKX_Q_LOADED_MARKER'] = '' # Only used here for running Notebook under mkdocs-jupyter during document generation.


In [None]:
import pykx as kx
kx.q.system.console_size = [10, 80]

The shortening of the import name to `kx` is done for readability of code that uses PyKX and is the intended standard for the library. As such we recommend that you always use `import pykx as kx` when using the library.

Below we load additional libraries used through this notebook.

In [None]:
import numpy as np
import pandas as pd

## The basic data structures of PyKX

Central to your interaction with PyKX are the various data types that are supported by the library, fundamentally PyKX is built atop a fully featured functional programming language `q` which provides small footprint data structures that can be used in analytic calculations and the creation of highly performant databases. The types we show below are generated from Python equivalent types but as you will see through this notebook 

In this section we will describe the basic elements which you will come in contact with as you traverse the library and explain why/how they are different.

### PyKX Atomic Types

In PyKX an atom denotes a single irreducible value of a specific type, for example you may come across `pykx.FloatAtom` or `pykx.DateAtom` objects generated as follows which may have been generated as follows from an equivalent Pythonic representation. 

In [None]:
kx.FloatAtom(1.0)

In [None]:
from datetime import date
kx.DateAtom(date(2020, 1, 1))

### PyKX Vector Types

Similar to atoms, vectors are a data structure composed of a collection of multiple elements of a single specified type. These objects in PyKX along with lists described below form the basis for the majority of the other important data structures that you will encounter including dictionaries and tables.

Typed vector objects provide significant benefits when it comes to the applications of analytics over Python lists for example. Similar to Numpy, PyKX gains from the underlying speed of it's analytic engine when operating on these strictly typed objects.

Vector type objects are always 1-D and as such are/can be indexed along a single axis.

In the following example we are creating PyKX vectors from common Python equivalent `numpy` and `pandas` objects.

In [None]:
kx.IntVector(np.array([1, 2, 3, 4], dtype=np.int32))

In [None]:
kx.toq(pd.Series([1, 2, 3, 4]))

### PyKX Lists

A `List` in PyKX can loosely be described as an untyped vector object. Unlike vectors which are optimised for the performance of analytics, lists are more commonly used for storing reference information or matrix data.

Unlike vector objects which are by definition 1-D in shape, lists can be ragged N-Dimensional objects. This makes them useful for the storage of some complex data structures but limits their performance when dealing with data-access/data modification tasks.

In [None]:
kx.List([[1, 2, 3], [1.0, 1.1, 1.2], ['a', 'b', 'c']])

### PyKX Dictionaries

A dictionary in PyKX is defined as a mapping between a direct key-value mapping, the list of keys and values to which they are associated must have the same count. While it can be considered as a key-value pair, it is physically stored as a pair of lists.

In [None]:
print(kx.Dictionary({'x': [1, 2, 3], 'x1': np.array([1, 2, 3])}))

### PyKX Tables

Tables in PyKX are a first-class typed entity which live in memory. They can be fundamentally described as a collection of named columns implemented as a dictionary. This mapping construct means that tables in PyKX are column-oriented which makes analytic operations on specified columns much faster than would be the case for a relational database equivalent.

Tables in PyKX come in many forms but the key table types are as follows

- `pykx.Table` 
- `pykx.KeyedTable`
- `pykx.SplayedTable`
- `pykx.PartitionedTable`

In this section we will deal only with the first two of these which constitute specifically the in-memory data table types. As will be discussed in later sections `Splayed` and `Partitioned` tables are memory-mapped on-disk data structures, these are derivations of the `pykx.Table` and `pykx.KeyedTable` type objects.

#### `pykx.Table`

In [None]:
print(kx.Table([[1, 2, 'a'], [2, 3, 'b'], [3, 4, 'c']], columns = ['col1', 'col2', 'col3']))

In [None]:
print(kx.Table(data = {'col1': [1, 2, 3], 'col2': [2 , 3, 4], 'col3': ['a', 'b', 'c']}))

#### `pykx.KeyedTable`

In [None]:
kx.Table(data = {'x': [1, 2, 3], 'x1': [2, 3, 4], 'x2': ['a', 'b', 'c']}).set_index(['x'])

### Other Data Types

The above types outline the majority of the important type structures in PyKX but there are many others which you will encounter as you use the library, below we have outlined some of the important ones that you will run into through the rest of this notebook.

#### `pykx.Lambda`

A `pykx.Lambda` is the most basic kind of function within PyKX. They take between 0 and 8 parameters and are the building blocks for most analytics written by users when interacting with data from PyKX.

In [None]:
pykx_lambda = kx.q('{x+y}')
type(pykx_lambda)

In [None]:
pykx_lambda(1, 2)

#### `pykx.Projection`

Similar to [functools.partial](https://docs.python.org/3/library/functools.html#functools.partial), functions in PyKX can have some of their parameters fixed in advance, resulting in a new function, which is called a projection. When this projection is called, the fixed parameters are no longer required, and cannot be provided.

If the original function had `n` total parameters, and it had `m` provided, the result would be a function (projection) that requires a user to input `n-m` parameters.

In [None]:
projection = kx.q('{x+y}')(1)
projection

In [None]:
projection(2)

---

## Accessing and creating PyKX objects

Now that we have seen some of the PyKX object types that you will encounter, practically speaking how will they be created in real-world scenarios?

### Creating PyKX objects from Pythonic data types

One of the most common ways that PyKX data is generated is through conversions from equivalent Pythonic data types. PyKX natively supports conversions to and from the following common Python data formats.

- Python
- Numpy
- Pandas
- PyArrow

In each of the above cases generation of PyKX objects is facilitated through the use of the `kx.toq` PyKX function.

In [None]:
pydict = {'a': [1, 2, 3], 'b': ['a', 'b', 'c'], 'c': 2}
kx.toq(pydict)

In [None]:
nparray = np.array([1, 2, 3, 4], dtype = np.int32)
kx.toq(nparray)

In [None]:
pdframe = pd.DataFrame(data = {'a':[1, 2, 3], 'b': ['a', 'b', 'c']})
kx.toq(pdframe)

### Random data generation

PyKX provides users with a module for the creation of random data of user specified PyKX types or their equivalent Python types. The creation of random data is useful in prototyping analytics and is used extensively within our documentation when creating test examples.

As a first example you can generate a list of 1,000,000 random floating point values between 0 and 1 as follows

In [None]:
kx.random.random(1000000, 1.0)

If instead you wish to choose values randomly from a list, this can be facilitated by using the list as the second argument to your function

In [None]:
kx.random.random(5, [kx.LongAtom(1), ['a', 'b', 'c'], np.array([1.1, 1.2, 1.3])])

Random data does not only come in 1-Dimensional forms however and modifications to the first argument to be a list allow you to create multi-Dimensional PyKX Lists. The below examples are additionally using a PyKX trick where nulls/infinities can be used to generate random data across the full allowable range

In [None]:
kx.random.random([2, 5], kx.GUIDAtom.null)

In [None]:
kx.random.random([2, 3, 4], kx.IntAtom.inf)

Finally, users can set the seed for the random data generation explicitly allowing users to have consistency over the generated objects. This can be completed globally or for individual function calls

In [None]:
kx.random.seed(10)
kx.random.random(10, 2.0)

In [None]:
kx.random.random(10, 2.0, seed = 10)

### Running q code to generate data

As mentioned in the introduction PyKX provides an entrypoint to the vector programming language q, as such users of PyKX can execute q code directly via PyKX within a Python session. This is facilitated through use of calls to `kx.q`.

Create some q data:

In [None]:
kx.q('0 1 2 3 4')

In [None]:
kx.q('([idx:desc til 5]col1:til 5;col2:5?1f;col3:5?`2)')

Apply arguments to a user specified function `x+y`

In [None]:
kx.q('{x+y}', kx.LongAtom(1), kx.LongAtom(2))

### Read data from a CSV file

A lot of data that you run into for data analysis tasks comes in the form of CSV files, PyKX similar to Pandas provides a CSV reader called via `kx.q.read.csv`, in the following cell we will create a CSV to be read in using PyKX

In [None]:
import csv

with open('pykx.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    field = ["name", "age", "height", "country"]
    
    writer.writerow(field)
    writer.writerow(["Oladele Damilola", "40", "180.0", "Nigeria"])
    writer.writerow(["Alina Hricko", "23", "179.2", "Ukraine"])
    writer.writerow(["Isabel Walter", "50", "179.5", "United Kingdom"])

In [None]:
kx.q.read.csv('pykx.csv', types = {'age': kx.LongAtom, 'country': kx.SymbolAtom})

In [None]:
import os
os.remove('pykx.csv')

### Querying external Processes via IPC

One of the most common usage patterns in organisations with access to data in kdb+/q you will encounter is to query this data from an external server process infrastructure. In the example below we assume that you have q installed in addition to PyKX, see [here](https://kx.com/kdb-insights-personal-edition-license-download/) to install q alongside the license access for PyKX.

First we set up a q/kdb+ server setting it on port 5050 and populating it with some data in the form of a table `tab`

In [None]:
import subprocess
import time

try:
    proc = subprocess.Popen(
        ('q', '-p', '5000'),
        stdin=subprocess.PIPE,
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )
    time.sleep(2)
except:
    raise kx.QError('Unable to create q process on port 5000')

Once a q process is available you can establish a connection to it for synchronous query execution as follows

In [None]:
conn = kx.SyncQConnection(port = 5000)

You can now run q commands against the q server

In [None]:
conn('tab:([]col1:100?`a`b`c;col2:100?1f;col3:100?0Ng)')
conn('select from tab where col1=`a')

Or use the PyKX query API

In [None]:
conn.qsql.select('tab', where=['col1=`a', 'col2<0.3'])

Or use PyKX's context interface to run SQL server side if it's available to you

In [None]:
conn('\l s.k_')
conn.sql('SELECT * FROM tab where col2>=0.5')

Finally the q server used for this demonstration can be shut down

In [None]:
proc.stdin.close()
proc.kill()

---

## Running analytics on objects in PyKX

Like many Python libraries including Numpy and Pandas PyKX provides a number of ways that it's data can be used with analytics defined internal to the library and which you have self generated.

### Using in-built methods on PyKX Vectors

When you are interacting with PyKX Vectors you may wish to gain insights into these objects through the application of basic analytics such as calculation of the `mean`/`median`/`mode` of the vector

In [None]:
q_vector = kx.random.random(1000, 10.0)

In [None]:
q_vector.mean()

In [None]:
q_vector.max()

The above is useful for basic analysis but will not be sufficient for more bespoke analytics on these vectors, to allow you more control over the analytics run you can also use the `apply` method.

In [None]:
def bespoke_function(x, y):
    return x*y

q_vector.apply(bespoke_function, 5)

### Using in-built methods on PyKX Tables

In addition to the vector processing capabilities of PyKX your ability to operate on Tabular structures is also important. Highlighted in greater depth within the Pandas-Like API documentation [here](../user-guide/advanced/Pandas_API.ipynb) these methods allow you to apply functions and gain insights into your data in a way that is familiar.

In the below example you will use combinations of the most commonly used elements of this Table API operating on the following table

In [None]:
N = 1000000
example_table = kx.Table(data = {
    'sym' : kx.random.random(N, ['a', 'b', 'c']),
    'col1' : kx.random.random(N, 10.0),
    'col2' : kx.random.random(N, 20)
    }
)
example_table

You can search for and filter data within your tables using `loc` similarly to how this is achieved by Pandas as follows

In [None]:
example_table.loc[example_table['sym'] == 'a']

This behavior also is incorporated when retrieving data from a table through the `__get__` method as you can see here

In [None]:
example_table[example_table['sym'] == 'b']

You can additionally set the index columns of the table, when dealing with PyKX tables this converts the table from a `pykx.Table` object to a `pykx.KeyedTable` object

In [None]:
example_table.set_index('sym')

Additional to basic data manipulation such as index setting you also get access to analytic capabilities such as the application of basic data manipulation operations such as `mean` and `median` as demonstrated here

In [None]:
print('mean:')
print(example_table.mean(numeric_only = True))

print('median:')
print(example_table.median(numeric_only = True))

You can make use of the `groupby` method which groups the PyKX tabular data which can then be used for analytic application.

In your first example let's start by grouping the dataset based on the `sym` column and then calculating the `mean` for each column based on their `sym`

In [None]:
example_table.groupby('sym').mean()

As an extension to the above groupby you can now consider a more complex example which is making use of `numpy` to run some calculations on the PyKX data, you will see later that this can be simplified further in this specific use-case

In [None]:
def apply_func(x):
    nparray = x.np()
    return np.sqrt(nparray).mean()

example_table.groupby('sym').apply(apply_func)

Time-series specific joining of data can be completed using `merge_asof` joins. In this example a number of tables with temporal information namely a `trades` and `quotes` table

In [None]:
trades = kx.Table(data={
    "time": [
        pd.Timestamp("2016-05-25 13:30:00.023"),
        pd.Timestamp("2016-05-25 13:30:00.023"),
        pd.Timestamp("2016-05-25 13:30:00.030"),
        pd.Timestamp("2016-05-25 13:30:00.041"),
        pd.Timestamp("2016-05-25 13:30:00.048"),
        pd.Timestamp("2016-05-25 13:30:00.049"),
        pd.Timestamp("2016-05-25 13:30:00.072"),
        pd.Timestamp("2016-05-25 13:30:00.075")
    ],
    "ticker": [
       "GOOG",
       "MSFT",
       "MSFT",
       "MSFT",
       "GOOG",
       "AAPL",
       "GOOG",
       "MSFT"
   ],
   "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
   "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
})
quotes = kx.Table(data={
   "time": [
       pd.Timestamp("2016-05-25 13:30:00.023"),
       pd.Timestamp("2016-05-25 13:30:00.038"),
       pd.Timestamp("2016-05-25 13:30:00.048"),
       pd.Timestamp("2016-05-25 13:30:00.048"),
       pd.Timestamp("2016-05-25 13:30:00.048")
   ],
   "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   "quantity": [75, 155, 100, 100, 100]
})

print('trades:')
display(trades)
print('quotes:')
display(quotes)

When applying the asof join you can additionally used named arguments to ensure that it is possible to make a distinction between the tables that the columns originate. In this case suffixing with `_trades` and `_quotes`

In [None]:
trades.merge_asof(quotes, on='time', suffixes=('_trades', '_quotes'))

### Using PyKX/q native functions

While use of the Pandas-Like API and methods provided off PyKX Vectors provides an effective method of applying analytics on PyKX data the most efficient and performant way you can run analytics on your data is through the use of the PyKX/q primitives which are available through the `kx.q` module.

These include functionality for the calculation of moving averages, application of asof/window joins, column reversal etc. A full list of the available functions and some examples of their usage can be found [here](../api/pykx-execution/q.md).

Here are a few examples of usage of how you can use these functions, broken into sections for convenience

#### Mathematical functions

##### mavg

Calculate a series of average values across a list using a rolling window

In [None]:
kx.q.mavg(10, kx.random.random(10000, 2.0))

##### cor

Calculate the correlation between two lists

In [None]:
kx.q.cor([1, 2, 3], [2, 3, 4])

In [None]:
kx.q.cor(kx.random.random(100, 1.0), kx.random.random(100, 1.0))

##### prds

Calculate the cumulative product across a supplied list

In [None]:
kx.q.prds([1, 2, 3, 4, 5])

#### Iteration functions

##### each

Supplied both as a standalone primitive and as a method for PyKX Lambdas `each` allows you to pass individual elements of a PyKX object to a function

In [None]:
kx.q.each(kx.q('{prd x}'), kx.random.random([5, 5], 10.0, seed=10))

In [None]:
kx.q('{prd x}').each(kx.random.random([5, 5], 10.0, seed=10))

#### Table functions

##### meta

Retrieval of metadata information about a table

In [None]:
qtab = kx.Table(data = {
    'x' : kx.random.random(1000, ['a', 'b', 'c']).grouped(),
    'y' : kx.random.random(1000, 1.0),
    'z' : kx.random.random(1000, kx.TimestampAtom.inf)
})

In [None]:
kx.q.meta(qtab)

##### xasc

Sort the contents of a specified column in ascending order

In [None]:
kx.q.xasc('z', qtab)