# Contrib/DataMiner

## Contents |

# Data Miner

Sample Data Miner in kdb+. Available from https://github.com/kxcontrib/jloveless/tree/master/data_miner.

## Overview

Kdb+ is ideal for data mining applications. Application logic is in q and data queries are in kdb+- direct access and manipulation. The components of queries are created in matrix form for easy manipulation- then passed directly to query the data via functional select. The following lab will walk through this code and the major features of the application. This application uses a genetic algorithm framework to maximum the sum of the column fit. This is known as the maximum subarray sums problem. The problem is simple, maximize the sum of the column FIT by using any combination of intervals of the remaining arrays.

Lower dimension solutions: http://www-static.cc.gatech.edu/people/home/kalyan/papers/mss-ppl95.pdf#search=%22k%20maximum-sum%20subarrays%20problem%22

We will examine the problem in N dimensions.

## Setup and Example Table

The table under study should be unkeyed and must be named **db** and must contain a column **FIT** ( as in fitness). The system will optimize combinations towards the column **FIT** using the variables allowed in the list **il** (independent list)

//create a sample database: notice that time can be used as time is simly float q)db:flip `time`var1`var2`var3`var4`FIT!({1000?x}each(12:00:00;10;10;10;5.0;100?(-1*(100?10)),(50?10))) //define our available attributes to use- these are the columns we will allow solutions //to be created from. We will refer to these as attributes. q) il:`time`var1`var2`var3`var4

To begin, we will define some initial parameters. The most import here is the **bckts** parameter. This specifies how many buckets to place the values of attributes in. We do this in order to reduce the dimensionality- at the cost of specificity.

//system inputs gen:5; //number of generations complx:(floor (.5*count il)); //maximum complexity allowed for a random choice, e.g. 1/2 the number of variables recssize:rndsize:10*mutesize:shftsize:elitesize:5*crossover:joinsize:5000; //set's sizes for components sft:{1}; //how much should a point move on shift? e.g. if 2 the can be 0 1 2 bckts: 5; //number of buckets to to place variables in

We must now define some utility functions, including a fitness function. This is the function that’s is applied to solutions to determine correctness. In this case we will simply sum FIT@ solution indices.

getfit:{(+/)db[`FIT]@x}; //get fitness for a given set comb:{m:x:key x;do[y-1;x:{raze{y,/:(1+max y)_x}[y]each x}[x;m]];x}; //permutations gettop25:{keepNdis[2;sm@distinct raze{where 3=x}each 4 xrank'sm (`FIT`cntbi)]}; //top quartile for each component of fitness gettop50:{keepNdis[2;sm@distinct raze{where 1 getbot50:{keepNdis[2;sm@distinct raze{where 1>x}each 4 xrank'sm (`FIT`cntbi)]}; //bottom 2 quartiles for each component of fitness keepNdis:{[n;t](+)(cols t)!flip where n&count each group flip value flip t}; //keep n distinct record from t memclr:{![`.;();0b;enlist x]}; //clear memory used indx2eng:{{pairs[x@0]x@1}@/:x}; //lookup the english version of a solution using index

We now create a list of lists, where each individual list component is a collection of operator;attribute;value. We call this mnA as it is the **m**i**n**imum component for any **a**ttribute in our il.

q)mnA:distinct each((>=),\:/:il),/:'{asc value min each(x@(=)distinct xrank[bckt s;x])}peach db[il] q)show mnA ((~<;`time;00:30:26);(~<;`time;00:32:16);(~<;`time;06:41:35);(~<;`time;08:29:.. ((~<;`var1;0);(~<;`var1;3);(~<;`var1;6);(~<;`var1;9)) ((~<;`var2;1);(~<;`var2;2);(~<;`var2;3);(~<;`var2;5);(~<;`var2;6)) ((~<;`var3;4);(~<;`var3;5);(~<;`var3;6);(~<;`var3;8)) ((~<;`var4;0.7500965);(~<;`var4;1.205486);(~<;`var4;1.415407);(~<;`var4;3.695..

We perform the same calculation for the maximum, then collect these two solution sets into **pairs**. Pairs represents the components of the search space, the search space itself is all combinations.

q)count pairs 5 q)show pairs (((~<;`time;00:30:26);(~>;`time;00:30:26));((~<;`time;00:30:26);(~>;`time;00:.. (((~<;`var1;0);(~>;`var1;0));((~<;`var1;0);(~>;`var1;3));((~<;`var1;0);(~>;`v.. (((~<;`var2;1);(~>;`var2;1));((~<;`var2;1);(~>;`var2;2));((~<;`var2;1);(~>;`v.. (((~<;`var3;4);(~>;`var3;4));((~<;`var3;4);(~>;`var3;5));((~<;`var3;4);(~>;`v.. (((~<;`var4;0.7500965);(~>;`var4;0.7500965));((~<;`var4;0.7500965);(~>;`var4;.. q)show pairs@0 ~< time 00:30:26 ~> time 00:30:26 ~< time 00:30:26 ~> time 00:32:16 ~< time 00:30:26 ~> time 06:41:35 ~< time 00:30:26 ~> time 08:29:10 ~< time 00:30:26 ~> time 08:37:17 ~< time 00:32:16 ~> time 00:30:26 .. q)

We now calculate the index for each value in **pairs**, remove those intervals which have no index, then calculate the fitness for each individual value in pairs. We perform this last calculate using **peach**- **p**arallel **each**. Peach is simply the application of each one -s N distinct threads. Each index value set is passed to get fit on –s threads- we are assured the data returns ordered.

idx:{{?[db;x;();`i]}each x}each pairs; //calculate index for each pair value c:({where not 0=x}each({count each x}each idx));idx:idx@'c;pairs:pairs@'c; //only keep good indices c:where not 0=(count each idx);idx:idx@c;pairs:pairs@c;badil:il[(til count il)except c];il:il@c; //![`db;();0b;badil] if it didn't have an idx- get rid of it. This removes bad variables fit:{{getfit x}peach x}@/:idx;

Finnaly we clean up and table our starting information.

a:raze {x#y}'[count each pairs;til count pairs];v:raze iasc each pairs; sm:`FIT xdesc (+) `av`FIT`cntbi`src!(enlist each av:{{x,y}'[x;y]}'[a;raze v];raze fit;count each raze idx;(count (raze fit))#`init);memclr@/:(`a`v`av); mxcnt:count each pairs

At this point we define a general function for testing fitness. For a given set of attributes and values, **dofit** will return the fitness. Dofit performs the intersection in parallel.

dofit:{[av;src] chk:distinct asc sm[`av];av:av[where not (asc av) in chk]; //don't run it again bi:(inter/)peach({{idx[x[0]]@x[1]}each x} each av); //find intersections gfit:{getfit x}peach bi;`FIT xdesc (+) `av`FIT`cntbi`src!(av;gfit;count each bi;(count gfit)#src)};

We now have a general framework for running search algorithms- in 45 lines. The simplest algorithm is a random search. The algorithm randomly selects an attribute value pair from the list, runs dofit and holds the result.

//build a random generation randgen:{[rndsize] a:{asc (neg x)?count pairs}each (1+rndsize?complx); //initial attribute population: pure random v:raze each{{1?x}each count each pairs[x]}each a; //initial intervals for the chosen attributes (family): pure random av:{{x,y}'[x;y]}'[a;v];dofit[av;`rand]}; //join the attributes and values together, update sm

We then define a number of subsequent search methods (cross over, mutation, elite selection, and shifting. Each algorithm (function) will be run from our main execution loop- writing back their results to a central location. This allows each search to benefit from the results of previous other searches executions. The exercise is left to the user about which order is most optimal for line 97.

94 while[0 96 status,::0!select gen:gen,cnt:count i,maxFIT:max FIT,avgFIT:avg FIT by src from sm; 97 {sm::keepNdis[2;(sm,value x)]}each("randgen[rndsize]";"shftgen[shftsize]";"joingen[joinsize]";"crssgen[crossover]";"elitgen[elitesize]";"recsgen[recssize]"); 98 show select distinct maxFIT by src from status;gen-::1;show .z.Z;];

And finally the system runs, showing our progress as generations execute. This setup included 1 million records of data, and the 5 variables from the example. Since this was executing on x64, the algorithm takes big chunks each generation: e.g.

//pretty big chunks, x64 only recssize:rndsize:10*mutesize:shftsize:elitesize:5*crossover:joinsize:5000 // 32 bit ok- why are you running on 32 bit? Increase the number of generations to make up for decreased chunk. recssize:rndsize:10*mutesize:shftsize:elitesize:5*crossover:joinsize:50

he print out of solutions correspond to the underlying algorithms that generated them. For 32 bit, start with 100,000 records (max memory with this setup should be about 900 mb). You will notice the code will utilize –s threads. E.g. if you start with –s 4, make sure you have 4 CPU’s. The heavy peach is within the intersection of dofit[a;v] - this is possible because peach returns data in order.

begin fit calc 2006.09.17T13:21:04.266 2006.09.17T13:21:04.391 end fit calc //calculate each individual fitness- 130 ms. starting loop 5 2006.09.17T13:21:04.406 src | maxFIT ----| ------ init| 5 //this is the best singleton solution- individual array //each generation- we get a printout 2006.09.17T13:22:33.672 src | maxFIT -----| ------ crss | 5 elite| 12 init | 5 join | 9 rand | 9 recs | 0 shift| 9

Look at the table status, what queries can you perform which yeild insights? Could you maximize this table?

q)show select deltas maxFIT by src from status src | maxFIT -----| --------- crss | 5 4 0 0 elite| 12 0 0 0 init | 5 0 0 0 0 join | 9 0 0 0 rand | 9 0 0 0 recs | 0 5 0 3 shift| 9 0 0 3 q)show select deltas avgFIT by src from status src | avgFIT -----| --------------------------------------------- crss | -36954.38 23140.04 10356.84 3372.479 elite| -30832.09 17745.94 10026.67 2984.438 init | -510952.4 446901.8 55039.71 8545.832 430.8909 join | -8305.019 2350.709 4566.641 1296.76 rand | -41040.33 16490.83 18310.04 6100.395 recs | 0 -17837.18 14283.83 2650.61 shift| -40908.76 25909.88 12929.72 2018.11 q)

## Practical Use

- Flag good trades as true (FIT).
- Add a ton of variables
- Find combination of variables to define good trades.
- Better would be to have FIT as profit if position was taken

OR

- NBBO
- Flag FIT as 1b where not inside
- Add a ton a variables describing that moment
- Find combinations that should yield outside trades.

## Limits

Will use some RAM, db size is not an issue. Pops and buckets set time. 300 variables is no problem.

## Problems

This is an example application as such you can see it converges quickly.