You use data sharing to improve your data ingestion performance. Before this type of sharding is implemented, RDB sharding and HDB clustering should be used within a single pipeline.
If the writing and sorting of data is the limiting factor with data volumes in a single pipeline, you can shard and query across multiple pipelines. If the speed of data persistence is not an issue, you can use instances in the pipeline definition.
Implement data sharding¶
You implement data sharding by creating a table schema that contains two or more taxonomy elements that exposes the table to multiple pipelines. The taxonomies must share the same region and data-source but must differ with the data-class and sub-class.
Note
The pipelines sharing the same table schema must each maintain a distinct feed handler, tickerplant, etc. They must also store distinct id-col values for the table. For example, pipeline 1 handles id-col values `A -> `M and pipeline 2 handles id-col values `N -> `Z
You can query the data contained within the table schema on each pipeline simultaneously from a gateway client when the multi-pipe-route parameter is enabled within the entrypoint schema for the system.
Sample entrypoint pipeline¶
pipeline:
name: "DefaultEntrypoint"
type: "entrypoint"
enable-legacy-routing: false
expose-to-gw: true
expose-to-qr: true
enable-monitoring: true
proc-layout:
-
all: all-processes
taxonomy:
region: all
data-source: all
data-class: all
sub-class: all
processes:
gw:
instances: 1
enable-analyst: true
additional-q-libraries:
- gwClient
multi-pipeline-route: true
udfp:
instances: 1
enable-analyst: true
In this following example, a table schema streamTab details the taxonomy of two different pipelines that it will be exposed to.
Sample table schema¶
table:
name: streamTab
id-col: sym
time-col: time
intra-persist-type: splay
end-persist-type: date-partition
taxonomy:
-
region: global
data-source: stream
data-class: stream
sub-class: stream
-
region: global
data-source: stream
data-class: stream2
sub-class: stream2
columns:
-
name: time
data-type: timestamp
attribute: sorted
-
name: sym
data-type: symbol
attribute: grouped
-
name: price
data-type: float
-
name: volume
data-type: long
First sample pipeline¶
pipeline:
name: stream
type: realtime
expose-to-gw: true
taxonomy:
region: global
data-source: stream
data-class: stream
sub-class: stream
proc-layout:
-
all: all-processes
processes:
tp:
pub-mode: timer
pub-freq-ms: 100
log-to-journal: true
rollover-mode: daily-at-time
rollover-time: "00:00:00.001"
subscribe-from-delta-messaging: true
enable-analyst: true
port: 41221
rdb:
timeout: 0
enable-analyst: true
instances: 2
hdb:
timeout: 0
enable-analyst: true
instances: 2
ipdb:
write-freq: 3600000
write-row-limit: 0
enable-analyst: true
port: 41224
epdb:
timeout: 0
Second sample pipeline¶
pipeline:
name: streamSecond
type: realtime
expose-to-gw: true
taxonomy:
region: global
data-source: stream
data-class: stream2
sub-class: stream2
proc-layout:
-
all: all-processes
processes:
tp:
pub-mode: timer
pub-freq-ms: 100
log-to-journal: true
rollover-mode: daily-at-time
rollover-time: "00:00:00.001"
subscribe-from-delta-messaging: true
enable-analyst: true
port: 45499
rdb:
timeout: 0
enable-analyst: true
instances: 2
hdb:
timeout: 0
enable-analyst: true
instances: 2
ipdb:
write-freq: 3600000
write-row-limit: 0
enable-analyst: true
timeout: 0
epdb:
timeout: 0
You can query data from both pipelines simultaneously by indicating only the dataType parameter within the server API calls. The data will be returned as a single table.
Sample queries¶
// Target stream pipeline
.gwClient.query.sync[`getTicks;(`dataType`dataSource`dataClass`startDate`endDate`idList!(`streamTab;`stream;`stream;.z.d;.z.d;`))]
// Target streamSecond pipeline
.gwClient.query.sync[`getTicks;(`dataType`dataSource`dataClass`startDate`endDate`idList!(`streamTab;`stream;`stream2;.z.d;.z.d;`))]
//Target both
.gwClient.query.sync[`getTicks;(`dataType`startDate`endDate`idList!(`streamTab;.z.d;.z.d;`))]
Note
If the multi-pipe-route parameter within the entrypoint pipeline schema is false or excluded, the query targetting both pipelines will fail with a GWNoRouteException