kdb Insights Enterprise Product Tour
This product tour lets you try out selected features of kdb Insights Enterprise such as querying sample data, configuring code in an easy-to-use interface and visualizing your data.
Free Trial
The product tour is only available with our free trial. If you don't have the free trial, it's easy to sign up.
Please note that resources in free trial are only provisioned for the activities described in these instructions. Be aware that if you perform additional analysis you may encounter issues and unexpected behaviour due to these constraints.
Aim
The aim of this tour is to show you the power and speed of the data analytic capabilities provided by kdb Insights Enterprise. It guides you on how to gain insights from your data by running queries on the NYC taxi dataset showcasing the use of SQL, q, and Python.
This tour is divided into the following sections:
- The dataset - which introduces the data being analysed.
- The Query tab - which introduces you to the Query tab and how it is used to explore the data.
- Analysing the data - which guides you through querying the data to answer questions relating to daily trips, average fares, passenger counts, tipping behaviour and more.
- Next steps - which provides links to other useful resources.
The dataset
This product tour uses a taxi dataset which contains historical data from New York's yellow taxi network, provided by the NYC Taxi & Limousine Commission. Using this data you will gain insights into the journeys being taken.
The data contains approximately 3 million records from trips taken during the month of December 2021. This data records attributes such as pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
The Query tab
In kdb Insights Enterprise data analysis is done in the Query tab. This is accessible by clicking [+] on the ribbon bar and selecting Query, or by clicking Query on the Overview tab, as shown below.
The Query tab explained
- Get Data: The top part of the screen is used for extracting the data for analysis. You can extract the data using SQL, q, or a set of controls in the UI.
- Scratchpad: The middle section of the screen is used for analyzing the extracted data. You can use python or q, whichever you prefer, choosing from the language option at the bottom right of this section.
- View Data: The lower part of the screen is used for visualizing the analyzed data. The data can be visualized in three different ways using the Console, Table, or Visual tabs.
The following screenshot displays the Query tab.
Explore the data's shape
In this section we query how many records there are in the dataset.
-
Open the Query tab and select either SQL or Q, in the top part of the screen. Add the appropriate code depending on your query method.
-- how many records are there? SELECT COUNT(*) FROM trips
When using the
Q
option, first selectdatabase=taxi
andinstance=hdb
before running Get Data.// how many records are there? count trips
There around 3 million records in the dataset.
Session initialization
Please note the first query you run with Get Data may take a little longer while the session is initialized.
-
Enter a text name for the Output variable.
- Click Get Data.
Next, we query how many columns are in the dataset and what are the datatypes.
-
In the Get Data section of the screen, choose the
Q
tab; selectdatabase=taxi
andinstance=hdb
, and add the following code.// what are the columns and datatypes? meta trips
Syntax explanation
The
meta
operator allows you to inspect the schema of the table. This returns a table of the available columns, their types, and other information. The following columns are returned:- c: column name
- t: column type
- f: foreign keys
- a: attributes: modifiers applied for performance characteristics
-
Enter a text name for the Output variable.
- Click Get Data.
The following screenshot displays the output of meta trips run using q
.
You can see the columns that are part of the dataset and their types. At this stage you have a better understanding of the size and shape of the data, which will assist with the later analysis.
Analysing the data
You are now ready to start running some queries on this data. You will find out:
- What is the daily trip count?
- What is the daily average fare?
- What is the hourly passenger count?
- What is the passengers tipping behaviour?
- What is the relationship between tips and distance?
- What is the breakdown of tips by rate type?
- How many trips where there to and from the airport each day?
Daily trip count
In the Get Data part of the Query tab choose either SQL or Q
and enter the relevant query to find out the daily trip count.
-- how many trips per day?
SELECT date, COUNT(*) AS num_trips
FROM trips
WHERE date BETWEEN '2021-12-01' AND '2021-12-31'
GROUP BY date
Select database=taxi
and instance=hdb
.
// how many trips per day?
select num_trips:count i by date from trips where date within (2021.12.01;2021.12.31)
Syntax explanation
The above virtual column i
maps to a record index in the table.
A simple aggregation can be obtained by taking the count of this virtual column.
To assign the resulting aggregation to a new column name in the resulting table, we simply use a colon :
.
The following screenshot shows the daily trip count query run using q
.
How many trips occur each day?
Switching to the visual tab, in the lower part of the screen, you see that around 100,000 trips were taken per day during the month and that trips taken reduced towards the end of the month, with the lowest day being Christmas day.
This analysis could help taxi companies decide how many drivers are required on different days of the year.
Daily average fare
In the Get Data part of the Query tab choose either SQL or Q and enter the relevant query to find out the daily average fare.
-- what is the average fare each day?
SELECT date, avg(fare)
FROM trips
WHERE date BETWEEN '2021-12-01' AND '2021-12-31'
GROUP BY date
SQL
.
Select database=taxi
and instance=hdb
.
// what is the average fare each day?
select avg fare by date from trips where date within (2021.12.01;2021.12.31)
Q
.
What is the average fare each day?
Again using the visual tab, it is clear the number of trips per day was lower around Christmas, however, you can also see that the average fare amount was higher for these days.
This analysis could help taxi companies to determine which days are most lucrative possibly indicating higher demand.
Hourly passenger count
In the Get Data part of the Query tab choose Q and enter the following query to find out the hourly passenger count.
Select database=taxi
and instance=hdb
.
// 1. group the pickup time into 60 minute buckets using xbar
// 2. get the average number of passengers for the trips in each bucket
select avg passengers by buckets:60 xbar pickup_datetime.minute from trips where date within (2021.12.01;2021.12.31)
Syntax explanation
The xbar
function is used to group the data on the pickup datetime column into 60 minute buckets.
To assign the resulting buckets to a new column name in the resulting table, we simply use a colon :
.
The following screenshot shows the hourly passenger count query using q
.
What is the average number of passengers per trip each hour?
In the later hours of the day, trips have a higher average number of passengers than in the early morning hours.
This analysis could help taxi companies optimise the deployment of their drivers based on the seat capacity of the cars in their fleet.
Tipping behavior
In the Get Data part of the Query tab choose Q and enter the following query to find out the tipping behaviour.
Select database=taxi
and instance=hdb
.
// what is the average tip for each 15-minute timespan?
select avg tip by buckets:15 xbar pickup_datetime.minute from trips where date within (2021.12.01;2021.12.31)
The following screenshot displays the tipping behaviour query.
When are people giving the most tips?
Throughout the day, there is usually a consistent amount of tips, with a slight decline in the middle of the night. This decline is followed by a massive increase during the hours of 4am to around 7am.
This analysis could help taxi companies encourage their drivers to work shifts that include those early morning hours.
Compare tips and distance
For this example you can switch to the scratchpad for analyzing the extracted data in further detail.
-
In the top part of the Query tab choose Q and enter the following query and save it as a variable called
tip_buckets
.Select
database=taxi
andinstance=hdb
.// get the average tips in 15 minute intervals 0!select avg tip by buckets:15 xbar pickup_datetime.minute from trips where date within (2021.12.01;2021.12.31)
-
In the Get Data part of the Query tab choose Q and enter the following query and save it as a variable called
distance_buckets
.Select
database=taxi
andinstance=hdb
.// get the average distances in 60 minute intervals 0!select avg distance by buckets:60 xbar pickup_datetime.minute from trips where date within (2021.12.01;2021.12.31)
-
In the Scratchpad, in the middle part of the Query screen, select Python. Add the following python code to the scratchpad editor:
# join the tables 'tip_buckets' and 'distance_buckets' on the time column 'buckets' # this is joining tables with varying time granularity import pandas as pd pd.merge_asof(tip_buckets.pd(), distance_buckets.pd(), left_on="buckets", right_on="buckets")
Cursor position for Python
You may need to move the cursor to the final line of code and run
Ctrl+Enter
to visualize in Python. -
In the Visual tab, switch the chart type to Line.
-
Click Run Scratchpad. The line chart showing the results of the python query, on tip and distance travelled, is displayed, as shown in the following screenshot.
Syntax explanation
We need to import the
pandas
module to access thepandas.merge_asof
function to merge the tables together. When importing this module, we alias it aspd
.This
pd.merge_asof
function does not work on pykx tables so we need to convert this data structure into a pandas dataframe. This is done by calling the.pd()
function on the pykx table.In this example, we use the
pd.merge_asof
function to join two tables together of varying temporal granularity. -
The same output can be generated using Q in the scratchpad:
// join the tables 'tip_buckets' and 'distance_buckets' on the time column 'buckets' // this is joining tables with varying time granularity aj[`buckets;tip_buckets;distance_buckets]
Syntax explanation
aj
is a powerful timeseries join, also known as an asof join, where a time column in the first argument specifies corresponding intervals in a time column of the second argument.In this example, we use it to join two tables together of varying temporal granularity.
Does the distance of journey have an impact on tips earned?
Trips that are a longer distance tend to earn higher tips for drivers.
This analysis might influence drivers to choose to take longer distance fares as they can earn a higher tip.
Breakdown trips by rate type
In the top part of the Query tab choose either SQL or Q and enter the relevant query to find out the breakdown of trips by rate type.
-- how many trips fell under each rate type?
SELECT rate_type, COUNT(*) AS num_trips
FROM trips
WHERE date BETWEEN '2021-12-01' AND '2021-12-31'
GROUP BY rate_type
ORDER BY num_trips
Select database=taxi
and instance=hdb
.
// how many trips fell under each rate type?
`num_trips xdesc select num_trips:count i by rate_type from trips where date within (2021.12.01;2021.12.31)
How many trips occur for each area?
You can see here that most of the trips fall into Standard rate
, while there are also a good number of trips falling under the airport rates JFK
, Newark
, and Nassau or Westchester
.
This analysis could help taxi companies gauge how many of the trips their drivers did of each rate type so they can manage the fare rates associated with each rate type.
The following screenshot shows the number of trips under each rate type, as returned by the q query.
Daily airport trips
In the Get Data part of the Query tab choose either SQL or Q and enter the relevant query to find the number of daily airport trips.
-- how many trips to and from the airport were there each day?
SELECT date, rate_type, COUNT(*) AS num_trips
FROM trips
WHERE date BETWEEN '2021-12-01' AND '2021-12-31'
AND rate_type in ('JFK', 'Newark', 'Nassau or Westchester')
GROUP BY date, rate_type
SQL
query.
Select database=taxi
and instance=hdb
.
select num_trips:count i by date, rate_type from trips where date within (2021.12.01;2021.12.31), rate_type in `$("JFK"; "Newark"; "Nassau or Westchester")
q
query.
How many trips to and from the airport were there each day?
You can see here that the number of trips to and from the airport was relatively constant and that Christmas had a minimal effect on the daily airport journeys.
This analysis could help taxi companies ensure that they have enough drivers working during this period despite the holidays.
Timestamp data for Trip data
If you are using the Query
get data filters, please be aware that the taxi data covers the period from 2021.12.31D00:00:00.000000000, for duration of the day.
Next steps
- Learn how to build databases and pipelines to ingest, query and visualize data using the guided walkthrough.
- Build a real-time trading application or learn how to apply predictive analytics for manufacturing using the industry tutorials.