Data preprocessing utilities for kdb Insights publish operations.¶
Used internally by DBPublisher and HTTPPublisher to read and chunk datasets from files, URLs, and DataFrames before publishing.
Classes:
- ColumnType – Target column types for pandas type casting during preprocessing.
- DataFormat – Supported file formats for data ingestion.
- DataPreprocessor – Reads and chunks datasets from files, directories, or URLs.
ColumnType¶
Bases: AutoNameEnum
Target column types for pandas type casting during preprocessing.
Attributes:
- DATETIME – Convert column values to
pandas.Timestamp(datetime). - TIMEDELTA – Convert column values to
pandas.Timedelta. - NUMERIC – Convert column values to numeric (int or float).
DataFormat¶
Bases: AutoNameEnum
Supported file formats for data ingestion.
Attributes:
- JSON – JSON file (parsed as a single object).
- JSON_RECORDS – Newline-delimited JSON records format.
- CSV – CSV file.
- PARQUET – Apache Parquet file.
Functions:
- from_filename – Infer the data format from a filename's extension.
from_filename¶
from_filename(file_name)
Infer the data format from a filename's extension.
Parameters:
- file_name (
str | None) – Filename or path string to check.
Returns:
- – Matching DataFormat
- – value, the raw suffix string if unrecognised, or
Noneif - –
file_nameisNone.
DataPreprocessor¶
Reads and chunks datasets from files, directories, or URLs.
Used internally by DBPublisher and
HTTPPublisher. When neither
chunksize nor format is set, files are yielded as raw bytes.
When either is set, files are read into pandas DataFrames in chunks.
Functions:
- iter_data – Iterate over data from a file path, directory, or URL.
- iter_df_data – Yield chunked DataFrames from a file path or directory.
- iter_raw_data – Yield raw file bytes from a file path or directory.
- map_types – Cast DataFrame columns to target types.
iter_data¶
iter_data(path, chunksize=None, format=None)
Iterate over data from a file path, directory, or URL.
When neither chunksize nor format is set, yields raw bytes.
When either is set, reads into pandas.DataFrame chunks.
Parameters:
- path (
str | Path) – File path, directory, or URL (HTTP, S3, GCS). - chunksize (
int | None) – Rows per chunk. When set, yieldspandas.DataFrameinstead of raw bytes. Defaults toDEFAULT_CHUNK_SIZE(10 000 000) when a format is specified. - format (
DataFormat | str | None) – File format override. Auto-detected from extension if not set.
Yields:
iter_df_data¶
iter_df_data(path, chunksize=None, format=None)
Yield chunked DataFrames from a file path or directory.
Parameters:
- path (
Path) – File path or directory. For directories, processes each direct child file. - chunksize (
int | None) – Number of rows per chunk. Defaults toDEFAULT_CHUNK_SIZE(10 000 000). - format (
DataFormat | str | None) – File format. Auto-detected from file extension if not set.
Yields:
iter_raw_data¶
iter_raw_data(path)
Yield raw file bytes from a file path or directory.
Parameters:
- path (
Path) – File path or directory. For directories, yields each direct child file.
Yields:
map_types¶
map_types(df, type_map)
Cast DataFrame columns to target types.
Parameters:
- df (
DataFrame) – DataFrame whose columns to cast. - type_map (
dict | None) – Dict mapping column names to target type strings:"timedelta","datetime", or"numeric".
Returns:
DataFrame– DataFrame with casted columns and reset index.