You use data sharing to improve your data ingestion performance. Before this type of sharding is implemented, RDB sharding and HDB clustering should be used within a single pipeline.

If the writing and sorting of data is the limiting factor with data volumes in a single pipeline, you can shard and query across multiple pipelines. If the speed of data persistence is not an issue, you can use instances in the pipeline definition.

Implement data sharding¶

You implement data sharding by creating a table schema that contains two or more taxonomy elements that exposes the table to multiple pipelines. The taxonomies must share the same region and data-source but must differ with the data-class and sub-class.

Note

The pipelines sharing the same table schema must each maintain a distinct feed handler, tickerplant, etc. They must also store distinct id-col values for the table. For example, pipeline 1 handles id-col values `A -> `M and pipeline 2 handles id-col values `N -> `Z

You can query the data contained within the table schema on each pipeline simultaneously from a gateway client when the multi-pipe-route parameter is enabled within the entrypoint schema for the system.

Sample entrypoint pipeline¶

pipeline:
  name: "DefaultEntrypoint"
  type: "entrypoint"

  enable-legacy-routing: false
  expose-to-gw: true                   
  expose-to-qr: true
  enable-monitoring: true

  proc-layout:
    -
      all: all-processes

  taxonomy:
    region: all
    data-source: all
    data-class: all
    sub-class: all

  processes:
    gw:
      instances: 1
      enable-analyst: true
      additional-q-libraries:
       - gwClient
      multi-pipeline-route: true
    udfp:
      instances: 1
      enable-analyst: true

In this following example, a table schema streamTab details the taxonomy of two different pipelines that it will be exposed to.

Sample table schema¶

table:
  name: streamTab
  id-col: sym
  time-col:  time
  intra-persist-type: splay
  end-persist-type: date-partition

  taxonomy:
    -
      region: global
      data-source: stream
      data-class: stream
      sub-class: stream
    -
      region: global
      data-source: stream
      data-class: stream2
      sub-class: stream2

  columns:
    -
      name: time
      data-type: timestamp
      attribute: sorted
    -
      name: sym
      data-type: symbol
      attribute: grouped
    -
      name: price
      data-type: float
    -
      name: volume
      data-type: long

First sample pipeline¶

pipeline:
  name: stream
  type: realtime

  expose-to-gw: true

  taxonomy:
    region: global
    data-source: stream
    data-class: stream
    sub-class: stream

  proc-layout:
    -
      all: all-processes

  processes:
    tp:
      pub-mode: timer
      pub-freq-ms: 100
      log-to-journal: true
      rollover-mode: daily-at-time
      rollover-time: "00:00:00.001"
      subscribe-from-delta-messaging: true
      enable-analyst: true
      port: 41221
    rdb:
      timeout: 0
      enable-analyst: true
      instances: 2
    hdb:
      timeout: 0
      enable-analyst: true
      instances: 2
    ipdb:
      write-freq: 3600000
      write-row-limit: 0
      enable-analyst: true
      port: 41224
    epdb:
      timeout: 0

Second sample pipeline¶

pipeline:
  name: streamSecond
  type: realtime

  expose-to-gw: true

  taxonomy:
    region: global
    data-source: stream
    data-class: stream2
    sub-class: stream2

  proc-layout:
    -
      all: all-processes

  processes:
    tp:
      pub-mode: timer
      pub-freq-ms: 100
      log-to-journal: true
      rollover-mode: daily-at-time
      rollover-time: "00:00:00.001"
      subscribe-from-delta-messaging: true
      enable-analyst: true
      port: 45499
    rdb:
      timeout: 0
      enable-analyst: true
      instances: 2
    hdb:
      timeout: 0
      enable-analyst: true
      instances: 2
    ipdb:
      write-freq: 3600000
      write-row-limit: 0
      enable-analyst: true
      timeout: 0
    epdb:
      timeout: 0

You can query data from both pipelines simultaneously by indicating only the dataType parameter within the server API calls. The data will be returned as a single table.

Sample queries¶

// Target stream pipeline
.gwClient.query.sync[`getTicks;(`dataType`dataSource`dataClass`startDate`endDate`idList!(`streamTab;`stream;`stream;.z.d;.z.d;`))]

// Target streamSecond pipeline
.gwClient.query.sync[`getTicks;(`dataType`dataSource`dataClass`startDate`endDate`idList!(`streamTab;`stream;`stream2;.z.d;.z.d;`))]

//Target both
.gwClient.query.sync[`getTicks;(`dataType`startDate`endDate`idList!(`streamTab;.z.d;.z.d;`))]

Note

If the multi-pipe-route parameter within the entrypoint pipeline schema is false or excluded, the query targetting both pipelines will fail with a GWNoRouteException