Cookbook/IntegratingWithR

From Kx Wiki
Jump to: navigation, search

The wiki is moving to a new format and this page is no longer maintained. You can find the new page at code.kx.com/q/interfaces/with-r/.

The wiki will remain in place until the migration is complete. If you prefer the wiki to the new format, please tell the Librarian why.

Contents

kdb+ and R

Introduction

kdb+ and R are complementary technologies. kdb+ is the world’s leading timeseries database and incorporates a programming language called q. R is a programming language and environment for statistical computing and graphics. Both are tools used by data scientists to interrogate and analyze data. Their features sets overlap in that they both:

They do however have several differences:

When used together, kdb+ and R provide an excellent platform for easily performing advanced statistical analysis and visualization on large volumes of data.

R can be called as a server from Q, and vice-versa.

kdb+ and R Working Together

Given the complementary characteristics of the two languages, it is important to utilize their respective strengths. All the analysis could be done in kdb+; the q language is sufficiently flexible and powerful to replicate any of the pre-built R functions. Below are some best practice guidelines, although where the line is drawn between kdb+ and R will depend on both the system architecture and the strengths of the data scientists using the system.

Interfaces

There are four ways to interface kdb+ with R:

  1. R can connect to kdb+ and extract data - loads a shared library into R, connects to kdb+ via tcp/ip
  2. Embed R inside kdb+ and invoke R routines - loads the R library into kdb+, instantiates R
  3. kdb+ can connect to a remote instance of R via tcp/ip and invoke R routines remotely
  4. kdb+ can load the R maths library and invoke the R maths routines locally

From the point of view of the user, kdb+ and R may be:

Considering the potential size of the data, it is probably more likely that the kdb+ installation containing the data will be hosted remotely from the user. Points to consider when selecting the integration method are:

Performance Testing

Generally speaking, running analysis in kdb+ will be faster than R either by virtue of the implementation or due to the data being local to kdb+. Both languages have similar facilities for testing the performance of code. The below code samples calculate the time taken to do matrix multiplication on identical 100x100 matrices, 1000 times. kdb+ timings are in milliseconds, R timings are in seconds. For this operation kdb+ is more than 5 times faster than R (238ms against 1240ms). In q:

q)a:100 100#`float$til 100000
q)b:100 100#`float$til 100000
q)\t do[1000;a mmu b]
238

and in R:

> a<-matrix(0:99999,ncol=100,nrow=100)
> b<-matrix(0:99999,ncol=100,nrow=100)
> system.time(replicate(1000,a%*%b))
user system elapsed
    0.252 0.554 1.240

Calling kdb+ from R

Connecting from R to kdb+ (a.k.a. Q Server for R)

This is the most common scenario - users who are comfortable in R connecting to a kdb+ database to extract partially analyzed data into R for further local manipulation, analysis and display. A pre-built interface from R to kdb+ is available at [Q Server for R] and is currently available for linux (64bit), Windows (32bit and 64bit) and OSX operating systems. The interface allows R to connect to a kdb+ database and send a request to it, which can optionally return a result. There are three methods available:

To open and initialize a connection from R to a kdb+ process on localhost listening on port 5001, with a trade table loaded:

> source("c:/r/qServer.R") # change this path to be whatever is correct on your installation
> h<-open_connection("127.0.0.1",5001,"testusername:testpassword") 

To request data and plot it:

> execute(h,"select count i by date from trade")
date x
1 2014-01-09 5125833
2 2014-01-10 5902700
3 2014-01-13 4419596
4 2014-01-14 4106744
5 2014-01-15 6156630
> # Setting TZ and retrieving time as timestamp simplifies time conversion
> Sys.setenv(TZ = "GMT")
> res<-execute(h,"select tradecount:count i, sum size, last price, vwap: size wavg price by time:0D00:05 xbar date+time from trade where date=2014.01.14,sym=`GOOG,time within 09:30 16:00")
head(res)
                 time tradecount   size    price     vwap
1 2014-01-14 09:30:00       1471 142868 1132.000 1136.227
2 2014-01-14 09:35:00       1053  65599 1130.250 1132.674
3 2014-01-14 09:40:00       1019  77808 1132.422 1130.405
4 2014-01-14 09:45:00        563  39372 1130.846 1130.835
5 2014-01-14 09:50:00        586  38944 1129.312 1129.999
6 2014-01-14 09:55:00        573  44591 1131.100 1129.720
> plot(res$time ,res$price ,type="l",xlab="time",ylab="price")

which produces the plot in Figure 1:

RInterfaceFigure1.png

Figure 1: Last traded price plot drawn from R

More comprehensive graphing is available in additional R packages which can be freely downloaded. For example, using the xts package:

> library(xts)
# extract the HLOC buckets in 5 minute intervals
> res<-execute(h,"select high:max price,low:min price,open:first price, close:last price by time:0D00:05 xbar date+time from trade where date=2014.01.14,sym=`GOOG,time within 09:30 16:00")
# create an xts object from the returned data frame
# order on the time column
> resxts <-xts(res[,-1], order.by=res[,'time'])
# create a vector of colours for the graph
# based on the relative open and close prices
> candlecolors <- ifelse(resxts[,'close'] > resxts[,'open'], 'GREEN', 'RED')
# display the candle graph with approrpiate labels
> plot.xts(resxts,type='candles',width=100,candle.col=candlecolors,bar.col ='BLACK',xlab="time",ylab="price",main="GOOG HLOC")

produces the plot in Figure 2:

RInterfaceFigure2.png

Figure 2: Candlestick plot using xts package

Another popular package is the quantmod package which contains the chartSeries function.

> library(quantmod)
# extract the last closing price in 30 second buckets
> res<-execute(h,"select close:last price by time:0D00:00:30 xbar date+time from trade where date=2014.01.14,sym=`GOOG,time within 09:30 16:00")
# create the closing price xts object
> closes <-xts(res[,-1], order.by=res[,'time'])
# chart it. Add Bollinger Bands to the main graph
# add additional Relative Strength Indicator and Rate Of Change graphs
> chartSeries(closes,theme=chartTheme("white"),TA=c(addBBands(),addTA(RSI( closes)),addTA(ROC(closes))))

This produces the plot shown in Figure 3:

RInterfaceFigure3.png

Figure 3: Chart example from quantmod package

Close the connection when done: 

> close_connection(h)
[1] 0

ODBC

Although it is not the preferred method, the kdb+ ODBC driver can be used to connect to kdb+ from R if R is running on Windows. Details of how to install the ODBC driver can be found at

http://code.kx.com/wiki/Cookbook/ODBC/qserver

The RODBC package should be installed in R. An example is given below.

# install RODBC
> install.packages("RODBC")
# load it
> library(RODBC)
# create a connection to a predefined DSN
> ch <- odbcConnect("localhost:5000") # run a query
# s.k should be installed on the kdb+ server to enable standard SQL
# However, all statements can be prefixed with q) to run standard q.
> res <- sqlQuery(ch, paste('q)select count i by date from trade'))

Calling R from kdb+

There are three interfaces which allow you to invoke R from kdb+, for both 32 and 64bit builds. If the appropriate build is not available for your target system, it can be built from source by following the instructions outlined in the associated README - the source is in svn.

Embedding R inside kdb+ (a.k.a. R server for Q)

A shared library can be loaded which brings R into the kdb+ memory space, meaning all the R statistical routines and graphing capabilities can be invoked directly from kdb+. Using this method means data is not passed between remote processes. The library has five methods:

The R_HOME environment variable must be set prior to starting kdb+. To find out what that should be, run R from the bash shell and see the result of R.home()

> R.home()
[1] "/Library/Frameworks/R.framework/Resources"

and then set it accordingly in your environment (e.g. for osx with a bash shell)

$export R_HOME=/Library/Frameworks/R.framework/Resources

Optional additional environment variables are R_SHARE_DIR, R_INCLUDE_DIR, LD_LIBRARY_PATH (for libR.so).

An example is outlined below, using kdb+ to subselect some data and then passing it to R for graphical display.

q)select count i by date from trade
date      | x       
----------| --------
2014.01.07| 29205636
2014.01.08| 30953246
2014.01.09| 30395962
2014.01.10| 29253110
2014.01.13| 32763630
2014.01.14| 29721162
2014.01.15| 30035729
..

/- extract mid prices in 5 minute bars
q)mids:select mid:last .5*bid+ask by time:0D00:05 xbar date+time from quotes where date=2014.01.17,sym=`IBM,time within 09:30 16:00
q)mids
time                         | mid     
-----------------------------| --------
2014.01.15D09:30:00.000000000| 185.92  
2014.01.15D09:35:00.000000000| 185.74  
2014.01.15D09:40:00.000000000| 186.11  
2014.01.15D09:45:00.000000000| 186.36  
2014.01.15D09:50:00.000000000| 186.5   
2014.01.15D09:55:00.000000000| 186.98  
2014.01.15D10:00:00.000000000| 187.45  
2014.01.15D10:05:00.000000000| 187.48  
2014.01.15D10:10:00.000000000| 187.66  
..

/- Load in R
q)\l rinit.q
/- Pass the table into the R memory space
q)Rset["mids";mids]
0i
/- Graph it
q)Rcmd["plot(mids$time,mids$mid,type=\"l\",xlab=\"time\",ylab=\"price\")"]
0i

This will produce a plot as shown in Figure 4:

RInterfaceFigure4.png

Figure 4: Quote mid price plot drawn from kdb+

To close the graphics window, use dev.off() rather than the close button on the window.

q)Rcmd["dev.off()"]
0i

Alternatively, the table can be written to a file with

q)Rcmd["pdf(\"test.pdf\")"]
0i
q)Rcmd["plot(mids$time,mids$mid,type=\"l\",xlab=\"time\",ylab=\"price\")"]
0i
q)Rcmd["dev.off()"]
0i

If the kdb+ and R installations are running remotely from the user on a linux machine, the graphics can be seen locally using X11 forwarding over ssh.

Aside from using R's powerul graphics, this mechanism also allows you to call R analytics from within kdb+. Using a rather simple example of an average

q)\l rinit.q
q)prices:10?100
q)Rset["prices";prices]
0i
q)Rcmd["meanPrices<-mean(prices)"]
0i
q)Rget"meanPrices"
,55.6
q)avg prices / agrees with kdb+?
55.6

That's a trivial example to demonstrate the mechanism in which you can leverage the 5000 libraries available to R from within kdb+.

Remote R: Rserve

Rserve allows applications to connect remotely to an R instance over TCP/IP. The methods are the same as those outlined in section 4.1, the difference being that all data is passed over TCP/IP rather than existing in the same memory space. Every connection to Rserve has a separate workspace and working directory which means user defined variables and functions with name clashes will not overwrite each other. This is different to the previous method where if two users are using the same kdb+ process they can overwrite each others variables in both the kdb+ and R workspaces.

R Math Library

R contains a maths library which can be compiled standalone. The functions can then be exposed to kdb+ by wrapping them in C code which handles the mapping between R datatypes and kdb+ datatypes (K objects). See r-mathlib for an example of integrating kdb+ with the R API (i.e. making use of some statistical functions from q).

 q) \l rmath.q
 q) x:rnorm 1000     / create 1000 normal variates
 q) summary x        / simple statistical summary of x
 q) hist[x;10]       / show histogram (bin count) with 10 bins
 q) y:scale x        / x = (x - mean(x))/sd(x)
 q) quantile[x;.5]   / calculate the 50% quantile
 q) pnorm[0;1.5;1.5] / cdf value for 0 for a N(1.5,1.5) distribution
 q) dnorm[0;1.5;1.5] / normal density at 0 for N(1.5;1.5) distribution

Those interested in an embedded math lib may also be interested in Andrey's qml at http://althenia.net/qml

Example: Correlating Stock Price Returns

R has an in-built operation to produce a correlation matrix of aligned datasets. kdb+ does not, but one can easily be built as shown in section 5.4. In this example we will investigate different approaches for calculating the correlation of time-bucketed returns for a set of financial instruments. Possible approaches are:

  1. Extract raw data from kdb+ into R for each instrument and calculate the correlation
  2. Extract bucketed data from kdb+ into R, align data from different instru- ments, and correlate
  3. Extract bucketed data with prices for different instruments aligned from kdb+ into R, and correlate
  4. Calculate the correlations in kdb+

We will use a randomly created set of equities data, with prices ticking between 8am and 5pm.

q)select count i by date from trade where date.month=2014.01m
date      | x       
----------| --------
2014.01.07| 29205636
2014.01.08| 30953246
2014.01.09| 30395962
2014.01.10| 29253110
2014.01.13| 32763630
2014.01.14| 29721162
2014.01.15| 30035729
q)select count i by date from trade where date.month=2014.01m,sym=`GOOG
date      | x    
----------| -----
2014.01.07| 37484
2014.01.08| 31348
2014.01.09| 28573
2014.01.10| 32228
2014.01.13| 38461
2014.01.14| 39853
2014.01.15| 30136

We will use the R interface defined in section 3. The R and kdb+ installations are on the same host, so data extract timings do not include network transfer time but do include the standard serialization and de-serialization of data. We will load the interface and connect to kdb+ with:

> source("c:/r/qServer.R")
> h <- open_connection("127.0.0.1",9998,NULL)

Extract Raw Data into R

Retrieving raw data into R and doing all the processing on the R side is the least efficient way to approach this task in relation to processing time, network utilization, and memory usage. To illustrate, we can time how long it takes to extract one day’s worth of data for one symbol from kdb+ to R.

> system.time(res <- execute(h, "select time,sym,price from trade where date=2014.01.09,sym=`GOOG"))
   user  system elapsed 
  0.011   0.001   0.049

Given we wish to process over multiple dates and instruments, we will discount this method.

Extract Aggregated Data into R

The second approach is to extract aggregated statistics from kdb+ to R. The required statistics in this case are the price returns between consecutive time buckets for each instrument. The following q function extracts time bucketed data:

timebucketedstocks:{[startdate; enddate; symbols; timebucket]
 data:select last price by date,sym,time:timebucket xbar date+time from trade where date within (startdate;enddate),sym in symbols;  / extract the time bucketed data
 () xkey update return:1^price%prev price by sym from data /  calculate returns between prices in consecutive buckets and return the results unkeyed
 }

An example is:

q)timebucketedstocks[2014.01.09;2014.01.13;`GOOG`IBM;0D00:05]
date       sym  time                         price    return   
----------------------------------------------------------------
2014.01.09 GOOG 2014.01.09D04:00:00.000000000 1142     1        
2014.01.09 GOOG 2014.01.09D04:05:00.000000000 1142.5   1.000438 
2014.01.09 GOOG 2014.01.09D04:10:00.000000000 1142     0.9995624
2014.01.09 GOOG 2014.01.09D04:30:00.000000000 1143.99  1.001743 
2014.01.09 GOOG 2014.01.09D04:35:00.000000000 1144     1.000009 
2014.01.09 GOOG 2014.01.09D04:55:00.000000000 1144     1       
..

Once the data is in R it needs to be aligned and correlated. To align the data we will use a pivot function defined in the reshape package.

# Reduce the dataset as much as possible
# only extract the columns we will use
> res <- execute(h,"select time,sym,return from timebucketedstocks[2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:05]")
> head(res)
                 time  sym    return
1 2014-01-09 09:30:00 GOOG 1.0000000
2 2014-01-09 09:35:00 GOOG 0.9975051
3 2014-01-09 09:40:00 GOOG 0.9966584
4 2014-01-09 09:45:00 GOOG 1.0005061
5 2014-01-09 09:50:00 GOOG 1.0004707
6 2014-01-09 09:55:00 GOOG 0.9988128
> install.packages('reshape')
> library(reshape)
# Pivot the data using the re-shape package
> p <- cast(res, time~sym)
Using return as value column. Use the value argument to cast to override this choice
> head(p)
                 time      GOOG       IBM      MSFT
1 2014-01-09 09:30:00 1.0000000 1.0000000 1.0000000
2 2014-01-09 09:35:00 0.9975051 1.0006143 1.0002096
3 2014-01-09 09:40:00 0.9966584 1.0001588 1.0001397
4 2014-01-09 09:45:00 1.0005061 0.9998941 0.9986034
5 2014-01-09 09:50:00 1.0004707 0.9965335 1.0019580
6 2014-01-09 09:55:00 0.9988128 0.9978491 1.0022334

# And generate the correlation matrix
> cor(p)
          GOOG       IBM      MSFT
GOOG 1.0000000 0.2625370 0.1577429
IBM  0.2625370 1.0000000 0.2568469
MSFT 0.1577429 0.2568469 1.0000000

An interesting consideration is the timing for each of the steps and how that changes when the dataset gets larger.

> system.time(res <- execute(h,"select time,sym,return from timebucketedstocks[2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:05]"))
   user  system elapsed 
  0.001   0.001   0.145 
> system.time(replicate(10,p<-cast(res,time~sym)))
   user  system elapsed 
  0.351   0.012   0.357 
> system.time(replicate(100,cor(p)))
   user  system elapsed 
   0.04    0.00    0.04 

We can see that

q)\t select time,sym,return from timebucketedstocks[2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:05]
134

We can also analyze how these figures change as the dataset grows. If we choose a more granular time period for bucketing the data set will be larger. In our case we will use 10 second buckets rather than 5 minute buckets, meaning the result data set will be 30x larger.

> system.time(res <- execute(h,"select time,sym,return from timebucketedstocks[2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:00:10]"))
user system elapsed
  0.015   0.008   0.234
> system.time(p<-cast(res,time~sym))
Using return as value column. Use the value argument to cast to override this choice
user system elapsed
  0.950   0.048   0.998 


We can see that the time to extract the data increases by ~90mS. The kdb+ query time increases by 4mS, so the majority of the increase is due to shipping the larger dataset from kdb+ to R.

q)\t select time,sym,return from timebucketedstocks[2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:00:10]
138

The pivot time on the larger data set grows from 40mS to ~1000mS giving a total time to do the analysis of approximately 2300ms. As the dataset grows, the time to pivot the data in R starts to dominate the overall time.

Align Data in kdb+

Given the pivot performance in R, an alternative is to pivot the data on the kdb+ side. This has the added benefit of reducing the volume of data transported due to the fact that we can drop the time and sym identification columns as the data is already aligned. The below q function pivots the data. An explanation can be found at http://code.kx.com/wiki/Pivot

timebucketedpivot:{[startdate; enddate; symbols; timebucket]
 / Extract the time bucketed data
 data:timebucketedstocks[startdate;enddate;symbols;timebucket];
 / Get the distinct list of column names (the instruments)
 colheaders:value asc exec distinct sym from data;
 / Pivot the table, filling with 1 because if no value, the price has stayed the same and return the results unkeyed
 () xkey 1^exec colheaders#(sym!return) by time:time from data
 }

An example is:

q)timebucketedpivot[2014.01.09;2014.01.13;`GOOG`IBM;0D00:05]
time                          GOOG      IBM      
-------------------------------------------------
2014.01.09D09:30:00.000000000 1         1        
2014.01.09D09:35:00.000000000 0.9975051 1.000614 
2014.01.09D09:40:00.000000000 0.9966584 1.000159 
2014.01.09D09:45:00.000000000 1.000506  0.9998941
2014.01.09D09:50:00.000000000 1.000471  0.9965335
2014.01.09D09:55:00.000000000 0.9988128 0.9978491
2014.01.09D10:00:00.000000000 1.000775  0.9992017
..

Using the larger dataset example, we can then do

> system.time(res <- execute(h,"delete time from timebucketedpivot [2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:00:10]"))
user system elapsed
  0.003   0.004   0.225 
> cor(res)
          GOOG        IBM       MSFT
GOOG 1.0000000 0.15336531 0.03471400
IBM  0.1533653 1.00000000 0.02585773
MSFT 0.0347140 0.02585773 1.00000000

thus reducing the total query time from 2300ms to 860ms and additionally reducing the network usage.

Correlations in kdb+

A final approach is to calculate the correlations in kdb+ meaning that R is not used for any statistical analysis. The below function invokes the previously defined functions and creates the correlation matrix. Utilizing the function timebucketedpivot defined above, and

correlationmatrix:{[startdate; enddate; symbols; timebucket]
 / Extract the pivoted data
 data:timebucketedpivot[startdate;enddate;symbols;timebucket];
 / Make sure the symbol list is distinct
 / and only contains values present in the data
 symbols:asc distinct symbols inter exec distinct sym from data;
 / Calculate the list of pairs to correlate
 pairs:raze {first[x],/:1 _ x}each {1 _ x}\[symbols];
 / Return the pair correlation
 / Calculate two rows for each pair, with the same value in each correlate
 pair:{[data;pair]([]s1:pair;s2:reverse pair; correlation:cor[data pair 0; data pair 1])};
 paircor:raze correlatepair[flip delete time from data] each pairs;
 / Pivot the data to give a matrix 
 pivot:exec symbols#s1!correlation by sym:s2 from paircor;
 / fill with 1 for the diagonal 
 unkey () xkey 1f^pivot
 }

which can be run like this:

q)correlationmatrix[2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:00:10]
sym  GOOG      IBM        MSFT
      ------------------------------------
GOOG 1         0.1533653  0.034714  
IBM  0.1533653 1          0.02585773
MSFT 0.034714  0.02585773 1     
q)\t correlationmatrix[2014.01.09; 2014.01.15; `GOOG`IBM`MSFT; 0D00:00:10]
181

This solution executes the quickest and with the least network usage as the resultant correlation matrix returned to the user is small.

Summary

It can be seen from the above examples that it is most efficient to calculate the correlation matrix directly in kdb+ with the performance gains increasing as the size of the dataset increases. The kdb+ only approach also simplifies the architecture as it uses a single technology. It is up to the user to decide which approach is the best fit for their use-case and expertise, although some amount of initial work should always be done on the kdb+ side to avoid raw data extracts.

Example: Working with Smart Meter Data

To demonstrate the power of kdb+, an example using randomly generated smart meter data has been developed. This can be downloaded from http://code.kx.com/wsvn/code/cookbook_code/tutorial/. By following the instructions in the README, an example database can be built. The default database contains information on 100,000 smart meter cus- tomers from different sectors and in different regions over 61 days. The default database contains 9.6m records per day, 586m rows in total. A set of example queries are provided, and a tutorial to step through the queries and test the performance of kdb+. Users are encouraged to experiment with:

The data can be extracted from R for further analysis or visualisation. As an example, the code below will generate an average daily usage profile for each customer type (res = residential, com = commercial, ind = industrial) over a 10 day period.

# load the xtsExtra package
# this will overwrite some of the implementations
# loaded from the xts package (if already loaded)
> install.packages("xtsExtra", repos="http://r-forge.r-project.org") # for R 3.1 you may need an additional parameter type="source"
> library(xtsExtra)
# load the connection library > source("c:/r/qServer.R")
> h <- open_connection("127.0.0.1",9998,NULL)
# pull back the profile data
# customertypeprofiles takes 3 parameters
# [start date; end date; time bucket size]
> d<-execute(h,"customertypeprofiles[2013.08.01;2013.08.10;15]")
> dxts<-xts(d[,-1],order.by=d[,'time'])
# plot it
> plot.xts(dxts, screens=1, ylim=c(0,500000), auto.legend=TRUE, main=" Usage Profile by Customer Type")

which produces the plot in Figure 5:

RInterfaceFigure5.png

Figure 5: Customer usage profiles generated in kdb+ and drawn in R

Timezones

Note that R's timezone setting affects date transfers between R and q. In R:

> Sys.timezone()               # reads current timezone
> Sys.setenv(TZ = "GMT")       # sets GMT ("UTC" is the same)

For example, in the R server:

q)Rcmd "Sys.setenv(TZ=\"GMT\")"
0
q)Rget "date()"
"Fri Feb  3 06:33:43 2012"
q)Rcmd "Sys.setenv(TZ=\"EST\")"
0
q)Rget "date()"
"Fri Feb  3 01:33:57 2012"
Personal tools
Namespaces
Variants
Actions
Navigation
Print/export
Toolbox