Skip to content

Partitioning

This page explains the concepts of data partitioning and parallel ingestion in KDB.AI.

This page provides an overview of partitioning capabilities in KDB.AI, including supported data types and index types.

In KDB.AI, data partitioning involves breaking down a large dataset into smaller, more manageable parts. This is essential for maintaining performance and scalability.

Key advantages of partitioning:

  • Improved Efficiency: Partitioning distributes the workload across multiple shards on disk, enhancing query performance and resource utilization.
  • Enhanced Scalability: By partitioning data, vector databases can handle larger datasets more effectively, allowing for horizontal scaling, which means adding more shards to manage increased data volumes.
  • Date-Based Partitioning: Data is partitioned based on time intervals, such as days, months, or years. This method improves query performance for time-series data and simplifies data management by allowing efficient archiving and removal of old data.
  • Category-Based Partitioning: Data can also be partitioned based on geographic location, category, or other relevant criteria, optimizing data retrieval and processing.
  • Similarity-Based Partitioning: Data is often partitioned based on vector similarity, grouping similar vectors together. This reduces the time needed for cross-shards-on-disk searches and improves query efficiency.

Partitioning in KDB.AI includes the following components:

  • Partitioned Table: Standard partitioned table.
  • Partitioned Index: Index associated to each partition in a kdb+ database.
  • Partition Column: Table column on which the table is partitioned.

In a partitioned table, the data is divided into multiple smaller splayed tables, each stored in its own directory. This structure allows for more efficient data management and querying, especially for large datasets.

For example, consider a partitioned quotes table. Instead of storing all the data in a single table, the data is split into separate splayed tables based on a specific column, such as the date. Each splayed table is then stored in its own directory. In this case, you might have one splayed table for the quotes on October 4th, 2020, stored in a directory named 2020.10.04, and another splayed table for the quotes on October 6th, 2020, stored in a directory named 2020.10.06.

db
├── 2020.10.04
 ├── quotes
  ├── .d
  ├── price
  ├── sym
  └── time
├── 2020.10.06
 ├── quotes
..
└── sym
This way, each day’s data is isolated, making it easier and faster to query specific dates without scanning the entire dataset. This partitioning method enhances performance by reducing the amount of data that needs to be read during queries, especially when dealing with time-series data or other large datasets.

Note: Each partitioned table has a sym file which stores the symbols saved in the table as integers.

Supported data types for partitioning

KDB.AI users can partition data on any metadata column, provided the column matches one of the following data types:

  • Date: Useful for time-series data, where data is divided into partitions based on time intervals (e.g., daily, monthly).
  • Integer (int): Suitable for numerical data that can be logically segmented.
  • Symbol: Ideal for categorical data, such as customer IDs or product types.

Supported index types

KDB.AI supports partitioning on almost all index types for both local KDB.AI tables and external kdb+ tables. This flexibility allows you to optimize your data storage and retrieval strategies based on your specific use cases.

Parallel ingestion

To speed up and enhance the efficiency of data ingestion, especially when dealing with large volumes of data, KDB.AI supports parallel ingestion across multiple partitions.

Scalability metrics

KDB.AI provides metrics that demonstrate the scalability of each index type for local KDB.AI tables. These metrics help you understand the performance characteristics and scalability limits of your partitioned data structures.

Next steps