Contrib/DataMiner

From Kx Wiki
Jump to: navigation, search

Contents

Data Miner

Sample Data Miner in kdb+. Available from https://github.com/kxcontrib/jloveless/tree/master/data_miner.

Overview

Kdb+ is ideal for data mining applications. Application logic is in q and data queries are in kdb+- direct access and manipulation. The components of queries are created in matrix form for easy manipulation- then passed directly to query the data via functional select. The following lab will walk through this code and the major features of the application. This application uses a genetic algorithm framework to maximum the sum of the column fit. This is known as the maximum subarray sums problem. The problem is simple, maximize the sum of the column FIT by using any combination of intervals of the remaining arrays.

Lower dimension solutions: http://www-static.cc.gatech.edu/people/home/kalyan/papers/mss-ppl95.pdf#search=%22k%20maximum-sum%20subarrays%20problem%22

We will examine the problem in N dimensions.

Setup and Example Table

The table under study should be unkeyed and must be named db and must contain a column FIT ( as in fitness). The system will optimize combinations towards the column FIT using the variables allowed in the list il (independent list)

//create a sample database: notice that time can be used as time is simly float
q)db:flip `time`var1`var2`var3`var4`FIT!({1000?x}each(12:00:00;10;10;10;5.0;100?(-1*(100?10)),(50?10)))
//define our available attributes to use- these are the columns we will allow solutions
//to be created from. We will refer to these as attributes.
q) il:`time`var1`var2`var3`var4

To begin, we will define some initial parameters. The most import here is the bckts parameter. This specifies how many buckets to place the values of attributes in. We do this in order to reduce the dimensionality- at the cost of specificity.

//system inputs
gen:5; //number of generations
complx:(floor (.5*count il)); //maximum complexity allowed for a random choice, e.g. 1/2 the number of variables
recssize:rndsize:10*mutesize:shftsize:elitesize:5*crossover:joinsize:5000; //set's sizes for components
sft:{1}; //how much should a point move on shift? e.g. if 2 the can be 0 1 2
bckts: 5; //number of buckets to  to place variables in

We must now define some utility functions, including a fitness function. This is the function that’s is applied to solutions to determine correctness. In this case we will simply sum FIT@ solution indices.

getfit:{(+/)db[`FIT]@x}; //get fitness for a given set
comb:{m:x:key x;do[y-1;x:{raze{y,/:(1+max y)_x}[y]each x}[x;m]];x}; //permutations
gettop25:{keepNdis[2;sm@distinct raze{where 3=x}each 4 xrank'sm (`FIT`cntbi)]}; //top quartile for each component of fitness
gettop50:{keepNdis[2;sm@distinct raze{where 1
getbot50:{keepNdis[2;sm@distinct raze{where 1>x}each 4 xrank'sm (`FIT`cntbi)]}; //bottom 2 quartiles for each component of fitness
keepNdis:{[n;t](+)(cols t)!flip where n&count each group flip value flip t}; //keep n distinct record from t
memclr:{![`.;();0b;enlist x]}; //clear memory used
indx2eng:{{pairs[x@0]x@1}@/:x}; //lookup the english version of a solution using index

We now create a list of lists, where each individual list component is a collection of operator;attribute;value. We call this mnA as it is the minimum component for any attribute in our il.

q)mnA:distinct each((>=),\:/:il),/:'{asc value min each(x@(=)distinct xrank[bckt
s;x])}peach db[il]
q)show mnA
((~<;`time;00:30:26);(~<;`time;00:32:16);(~<;`time;06:41:35);(~<;`time;08:29:..
((~<;`var1;0);(~<;`var1;3);(~<;`var1;6);(~<;`var1;9))
((~<;`var2;1);(~<;`var2;2);(~<;`var2;3);(~<;`var2;5);(~<;`var2;6))
((~<;`var3;4);(~<;`var3;5);(~<;`var3;6);(~<;`var3;8))
((~<;`var4;0.7500965);(~<;`var4;1.205486);(~<;`var4;1.415407);(~<;`var4;3.695..

We perform the same calculation for the maximum, then collect these two solution sets into pairs. Pairs represents the components of the search space, the search space itself is all combinations.

q)count pairs
5
q)show pairs
(((~<;`time;00:30:26);(~>;`time;00:30:26));((~<;`time;00:30:26);(~>;`time;00:..
(((~<;`var1;0);(~>;`var1;0));((~<;`var1;0);(~>;`var1;3));((~<;`var1;0);(~>;`v..
(((~<;`var2;1);(~>;`var2;1));((~<;`var2;1);(~>;`var2;2));((~<;`var2;1);(~>;`v..
(((~<;`var3;4);(~>;`var3;4));((~<;`var3;4);(~>;`var3;5));((~<;`var3;4);(~>;`v..
(((~<;`var4;0.7500965);(~>;`var4;0.7500965));((~<;`var4;0.7500965);(~>;`var4;..
q)show pairs@0
~< time 00:30:26 ~> time 00:30:26
~< time 00:30:26 ~> time 00:32:16
~< time 00:30:26 ~> time 06:41:35
~< time 00:30:26 ~> time 08:29:10
~< time 00:30:26 ~> time 08:37:17
~< time 00:32:16 ~> time 00:30:26
..
q)

We now calculate the index for each value in pairs, remove those intervals which have no index, then calculate the fitness for each individual value in pairs. We perform this last calculate using peach- parallel each. Peach is simply the application of each one -s N distinct threads. Each index value set is passed to get fit on –s threads- we are assured the data returns ordered.

idx:{{?[db;x;();`i]}each x}each pairs; //calculate index for each pair value
c:({where not 0=x}each({count each x}each idx));idx:idx@'c;pairs:pairs@'c; //only keep good indices
c:where not 0=(count each idx);idx:idx@c;pairs:pairs@c;badil:il[(til count il)except c];il:il@c; //![`db;();0b;badil] if it didn't have an idx- get rid of it. This removes bad variables
fit:{{getfit x}peach x}@/:idx;

Finnaly we clean up and table our starting information.

a:raze {x#y}'[count each pairs;til count pairs];v:raze iasc each pairs;
sm:`FIT xdesc (+) `av`FIT`cntbi`src!(enlist each av:{{x,y}'[x;y]}'[a;raze v];raze fit;count each raze idx;(count (raze fit))#`init);memclr@/:(`a`v`av);
mxcnt:count each pairs

At this point we define a general function for testing fitness. For a given set of attributes and values, dofit will return the fitness. Dofit performs the intersection in parallel.

dofit:{[av;src]
  chk:distinct asc sm[`av];av:av[where not (asc av) in chk]; //don't run it again
  bi:(inter/)peach({{idx[x[0]]@x[1]}each x} each av); //find intersections
  gfit:{getfit x}peach bi;`FIT xdesc (+) `av`FIT`cntbi`src!(av;gfit;count each bi;(count gfit)#src)};

We now have a general framework for running search algorithms- in 45 lines. The simplest algorithm is a random search. The algorithm randomly selects an attribute value pair from the list, runs dofit and holds the result.

//build a random generation
randgen:{[rndsize]
  a:{asc (neg x)?count pairs}each (1+rndsize?complx); //initial attribute population: pure random
  v:raze each{{1?x}each count each pairs[x]}each a; //initial intervals for the chosen attributes (family): pure random
  av:{{x,y}'[x;y]}'[a;v];dofit[av;`rand]}; //join the attributes and values together, update sm

We then define a number of subsequent search methods (cross over, mutation, elite selection, and shifting. Each algorithm (function) will be run from our main execution loop- writing back their results to a central location. This allows each search to benefit from the results of previous other searches executions. The exercise is left to the user about which order is most optimal for line 97.

94 while[0
96 status,::0!select gen:gen,cnt:count i,maxFIT:max FIT,avgFIT:avg FIT by src from sm;
97 {sm::keepNdis[2;(sm,value x)]}each("randgen[rndsize]";"shftgen[shftsize]";"joingen[joinsize]";"crssgen[crossover]";"elitgen[elitesize]";"recsgen[recssize]");
98 show select distinct maxFIT by src from status;gen-::1;show .z.Z;];

And finally the system runs, showing our progress as generations execute. This setup included 1 million records of data, and the 5 variables from the example. Since this was executing on x64, the algorithm takes big chunks each generation: e.g.

//pretty big chunks, x64 only
recssize:rndsize:10*mutesize:shftsize:elitesize:5*crossover:joinsize:5000

// 32 bit ok- why are you running on 32 bit? Increase the number of generations to make up for decreased chunk.
recssize:rndsize:10*mutesize:shftsize:elitesize:5*crossover:joinsize:50

he print out of solutions correspond to the underlying algorithms that generated them. For 32 bit, start with 100,000 records (max memory with this setup should be about 900 mb). You will notice the code will utilize –s threads. E.g. if you start with –s 4, make sure you have 4 CPU’s. The heavy peach is within the intersection of dofit[a;v] - this is possible because peach returns data in order.

begin fit calc
2006.09.17T13:21:04.266
2006.09.17T13:21:04.391
end fit calc

//calculate each individual fitness- 130 ms.

starting loop
5
2006.09.17T13:21:04.406
src | maxFIT
----| ------
init| 5

//this is the best singleton solution- individual array

//each generation- we get a printout
2006.09.17T13:22:33.672
src  | maxFIT
-----| ------
crss | 5
elite| 12
init | 5
join | 9
rand | 9
recs | 0
shift| 9

Look at the table status, what queries can you perform which yeild insights? Could you maximize this table?

q)show select deltas maxFIT by src from status
src  | maxFIT
-----| ---------
crss | 5 4 0 0
elite| 12 0 0 0
init | 5 0 0 0 0
join | 9 0 0 0
rand | 9 0 0 0
recs | 0 5 0 3
shift| 9 0 0 3
q)show select deltas avgFIT by src from status
src  | avgFIT
-----| ---------------------------------------------
crss | -36954.38 23140.04 10356.84 3372.479
elite| -30832.09 17745.94 10026.67 2984.438
init | -510952.4 446901.8 55039.71 8545.832 430.8909
join | -8305.019 2350.709 4566.641 1296.76
rand | -41040.33 16490.83 18310.04 6100.395
recs | 0 -17837.18 14283.83 2650.61
shift| -40908.76 25909.88 12929.72 2018.11
q)

Practical Use

OR

Limits

Will use some RAM, db size is not an issue. Pops and buckets set time. 300 variables is no problem.

Problems

This is an example application as such you can see it converges quickly.

Personal tools
Namespaces
Variants
Actions
Navigation
Print/export
Toolbox