# Data analysis walkthrough

A simple workflow suitable for data analysis.

## Pre-flight

The first thing we need to do is create a repository. Repositories allow you to save versions of your work in a shared repository to share with other people and allow them to work on items in the repository as well. People who have access to the repository can push changes to it and pull changes from it. Right-click in the left tree pane of the window. A context menu will appear. Select New > Repository… from the context menu.

The New Repository dialog will appear. If this walkthrough is being done collaboratively, use a name that is unlikely to have been used before and is unique - for example: usernameDataWalkthrough, replacing username with your Analyst username. Going forward, the walkthrough will refer to the repository as exampleRepository.

The repository will now appear in the tree.

## Create a module in the workspace

Now let’s create a module in the repository. Right-click on the repository and select New > Module… from the menu. Modules are packages of artifacts and may contain code functions, code scripts, transformations, visualizations, and more.

Type .myModule in the dialog box and press OK.

The module will appear in the search pane. The small green plus sign indicates that the module has been added to the workspace but not yet committed or pushed to the repository. All entities are managed by Analyst and saved into the Analyst repository.

Right-click on the repository and select Git > Push from the context menu.

The Commit Message dialog will appear. This dialog allows you to enter a commit message, as well as set the Git author and Git email fields for this commit. Since you may be pushing changes to the repository on a regular basis, it makes sense to provide a meaningful commit message. Leaving the author and email fields blank, enter a commit message and click OK.

You will notice that the icon disappears from .myModule. This means that anyone else who has access to the repository is now able to pull the repository and get the new module.

## Cloning a Git repository

Now let’s clone a different Git repository from an external source. Right click in the tree and select Git > Clone. This opens the Credentials Dialog, where you can specify your credentials and the origin of the Git repository to clone.

Enter the following information into the Credentials Dialog, and then click OK.

The Pull Repository dialog appears. It allows you to pull any branch of the repository. Click OK.

This has cloned the remote Analyst Training repository into our workspace. Open the A.Tutorial.Data module in the analyst-training repository. In the next few steps, we will load temp.csv into memory,

## Import a dataset from CSV, JSON, ODBC or kdb+

Select Tools > Table Importer from the main menu.

In the open Table Importer dialog, select CSV for Source Format, then click Browse.

Click the WS button. This ensures you are in your Workspace directory. Double-click the A.Tutorial.Data directory. This will move you to the A.Tutorial.Data directory.

Select temp.csv and press Select.

The dialog will disappear. The file name will appear in the input field and you will see a preview of the contents of the file in the previewer at the bottom of the Table Importer. Press Next to move to the Schema tab.

The Schema tab is where the column names and types can be manipulated before importing. Here, we will want to change the types for the machine and sensor columns to Symbol. The signal, and active columns should be float. All of the remaining columns should be set to Long.

Pressing Next takes you to the Import tab where a final action can be selected. We could directly import the table, but instead we want to fix messy columns first. To do this, select Transform in the Import Type select. Finally, pressing Finish will launch into the Table Transformer.

## Table Transformer (KX Analyst only)

Note if not using KX Analyst, skip to the Data Simulation section to generate the sensors data.

Using the transformer, we can do more complex data transformations before importing the data. The transformer works on a sample of the data to build up a workflow of actions that will transform the data into a usable format. Once the transform has been built, it can be executed on the current data or saved into a compiled kdb+ function and parameterized.

For this transform we will add an Action node to the transform add a moving column.

First, right-click the node and select New > Action. Now right click the signal column and select Add Column....

Add Column
------------------------------------------------
New Column             Value
ma               4 mavg signal


After pressing Ok the add column action will appear in the action list for the given node. Finally, we can add an output node to the workflow to capture the data. Right-click the action node and select New > Output. This will open the Table Exporter to configure the output node. For now, we will import this data into our current process. To do this, select the KDB format with the table type of Memory and press Finish

Table Exporter
-------------------------------------------------------------------------------
Select: KDB - kdb+ Database

Table Format    : Memory
Table Name      : signals
If Table Exists : ☑ Overwrite  ☐ Append


The transformation can now be run using Transform > Run or alternatively it can be saved and executed like a function.

### ImportSensors

To save time, a more complete transformation has already been provided in A.Tutorial.IO/ImportSensors. Before opening this transform, ensure that the process is in the workspace directory.

This transform takes three types of sensor data from csv files, reads the contents, joins them together, summarizes and writes the data to tables in memory. The table below details the different nodes in this transformation.

Node Type Description
temp Source Temperature sensor data in a CSV file
pressure Source Pressure sensor data in a CSV file
weight Source Weight sensor data in a CSV file
cleanup-temp Action Add some features to the temperature table
cleanup-pressure Action Add some features to the pressure table
cleanup-weight Action Add some features to the weight table
pressure_temp Join Performs an insert on the pressure and temperature tables
signals Join Performs an insert on the joined table using weight sensors
aggregate-signals Action Adds some extra aggregations across all sensors
filter-signals Action Filters out signals to a single machine sensor combination
summarize Function Performs a custom summary aggregation
Graphic Graphic Displays a filtered signal reading plot
summary Output An output table of the summarized signals
signals Output An output table of sensor signal data

Select the Graphic node to display the graphs in the transformation.

Next select Transform > Run to perform the importer. The transform will take ~30 seconds to ingest the data on a reasonable laptop. The graphic that has been selected is showing a sensor signal plot of one sensor.

Alternatively, the transformation can be run as a function by invoking the transform.

// In a transformation, the first argument is for the inputs to the transform
// and the last argument is for the outputs. Providing null for both uses the defaults
// that were configured in the UI. A dictionary can be provided for each input or output
// that maps the name of the node in the UI to a new input or output configuration.
ImportSensors[::;::];

// Example with explicit input and output locations
ImportSensors[
temppressure!(:A.Tutorial.Data/temp.csv; :A.Tutorial.Data/Pressure.csv);

// Output to a csv file called signals.csv - we need to change the output format
// from a kdb+ table to a csv file by creating a new .im.io source descriptor
enlist[signals]!enlist .im.io.with.target[:signals_out.csv] .im.io.create csv
];

// Now there is a new file called signals_out.csv


## Data simulation

Important

In the interest of time and space, the data that we have imported is very small so that it could easily be shared. For the remainder of the walkthrough, we will switch to using a simulated version of the data that is much larger than the data we have just imported.

// Running the simulation will take a bit of time but will produce far more data than
// what was imported. Once complete, there will be around ~10M simulated sensor readings
// in the new sensors table.
sensors: sim.system[];

// Check the count of the sensors table
count sensors


## Visual Inspector

Now that the data has been generated, we can start exploring it. We can use the Visual Inspector to visualize and tumble the data. Back on the main IDE page, we can use the Process View to see which tables were imported. Select the Process tab in the search pane view to see functions and data in our process. Under the global tables tab, we should find the tables that were just imported.

└── . (Global)
└── Tables
├── sensors
├── signals
└── summary


Right-click on the sensors table and select Inspect. Alternatively, in a Scratchpad, type sensors and right-click the line and choose Inspect or press CTRL+I (Windows) or ⌘I (macOS).

sensors


To get a sense of the distribution of the sensors across their machines, we can plot a histogram of signal values. To add another dimension to the data, set the Fill Colour to be the average active rate.

Visual Inspector
-------------------------------------------------------------------------------
query: sensors
-------------------------------------------------------------------------------
Chart Type: Histogram                     | X Axis Column    : machine
| ...
| ✓ Enable Fill Colour
| Fill Aggregation : Average
| Fill Column      : active


Now that we understand the distribution of the activity per sensor, we can look at the average sensor readings over time. A visualization has been saved in A.Tutorial.Plots/ReadingsOverTime. Double-click the plot to view the readings over time.

└── analyst-training
└── A.Tutorial.Plots/Plots


Let's turn our focus to the individual sensors readings. Let's look at all the distinct sensor readings over time. Double-click the SignalBySensor plot in A.Tutorial.Plots to view all of the discrete signals. This will be a messy plot but will allow us to visualize the entire dataset.

└── analyst-training
└── A.Tutorial.Plots
└── SignalBySensor


In the visual, one signal seems to go above the others. Click on it brings up a tooltip that indicates the machine and signal as the id, that seem to be affected. Let's take a closer look at this signal. We can use the quick plotting library that is included with Grammar of Graphics to plot the signal for this sensor.

// Select only sensor that is failing. Replace the query keys with the ones from
// the tooltip in the visual inspector.
//
// Replace the mach_g value with the machine name and the pressure_a with the
// sensor name from the tooltip in the Visual Inspector
failing: select from sensors where machine=mach_g, sensor=pressure_a;
.qp.go[800; 500] plotSignal failing


This likely indicates to us that we need to investigate this sensor further. There may be a faulty sensor or an actual issue with the machine itself. We would have expected to see a complete signal produced from this sensor. Ideally, something more like the following.

// Select the first machine, sensor combo
firstId: select from sensors where id = first id;

// Plot the signal
.qp.go[800;500] plotSignal firstId


To clean this plot up and see more of the underlying patterns, we can use a simple smoothing function to reduce the noise in the data. Here we can use the smoothReadings function to help reduce the noise in the plot.

smooth: smoothReadings[10] firstId;
.qp.go[800;500] plotSignal smooth;


Let's take a look at one last facet of the data. We can use the Grammar of Graphics to express much more complex visualizations. For example, let's suppose we want to look at all of the temperature sensors for a given machine in our system. To get a high level comparison of all of these sensors, we can apply standard aggregations such as max, min, mean, etc. We can take all of these aggregations and plot them using a parallel line chart to represent each of the individual sensors. This gives us a high level comparison of the performance of these sensors and illustrates the complex visuals that can be generated from GG.

// Select all temperature sensors
allTemp: select from sensors where machine = first machine, sensor like "temp*";

// Plot all temperature sensor values as an aggregated rollup in parallel coordinates
.qp.go[1000;400] plotParallel allTemp


## Summary

In this quick data analysis project, we took some simulated sensor data and imported it using using the Table Transformer. We then performed some ad-hoc visualizations using the Visual Inspector tool. We then built some custom visualizations using the Grammar of Graphics.

While performing our visualization on the sensors data, we identified a potential issue with one of the sensors in the system. We then created some custom visualization to investigate this further.

And finally, we will take the first 100 individual sensor signals by machine and plot them in a single visual summary.

// Select the first 100 sensors and filter the time dimension 1:100
first100: 100#value id xgroup select from sensors where 0 = i mod 100;

// Theme for sensor grid
theme: legend_useaxis_use_yaxis_use_xtitle_fontsizetitle_padding!(0b; 0b; 0b; 12; 12);

// Plot the first 100 sensors in a grid with 5 across by 20 down
.qp.go[1000;2000]
.qp.title["Sensor Signal by Machine"]
.qp.theme[theme]
.qp.grid[5 0N]
{[sensor]
.qp.title[" " sv string first each (sensormachine;sensorsensor)]
plotSignal flip sensor
} each first100