Data Channels

Data channels map between an external data storage mechanism and a standard interface used by scoring flows to access data. This abstraction allows scoring flows to be agnostic of the source or destination data formats, protocols, and storage mechanisms.

There are two kinds of data channels - a data source and a data sink. A data source provides input data to a scoring flow. A data sink consumes output data from a scoring flow.

A scoring pipeline consists of a data source, one or more scoring flows, and a data sink. A deployed data channel can participate in multiple scoring pipelines.

Architecture

Contents

Identification

Data channels are uniquely identified by:

  • kind of channel, either a source or a sink.
  • type of data handled, e.g. File, Kafka, RBDMS, etc.
  • name specified when a data channel is deployed.
  • format (schema) of the data, specific during channel configuration.

The name and format are the signature used to uniquely identify a data channel at runtime.

When a data sink and data source are configured for a pipeline, only the name and format are specified.
This ensures that pipelines are independent of the actual type of data handled by a data channel.

Lifecycle

Data channels are started and stopped independently of scoring pipelines. When a scoring pipeline is deployed, any data channels required by that scoring pipeline must already be running.

A data channel can be part of multiple scoring pipelines simultaneously.

When a scoring pipeline exits, it has no impact on the data channels it used; the data channels continue to run. If a data channel is terminated, any active pipelines are also terminated because they no longer have a data source or data sink available.

When a data channel starts it automatically registers its meta-data (name and format) with the data channel registry. When a channel stops it’s meta-data is removed from the data channel registry.

The data channel registry provides discovery of running data channels available for use by scoring pipelines.

Environments

Data channels are deployed into an environment. These data channels are available for use by any scoring pipelines also deployed into that environment.

It is also possible to expose a data channel to other environments. This makes the data channel also visible to pipelines running in those environments. There is still only a single data channel running, even when exposed to other environments - it is just a shared resource across environments.

In both cases, deployment and exposure, the data channel must be approved for the environment.

The diagram above shows an example of a data source named A and a data sink named B deployed into an environment named Data Channels. These data channels are also exposed to the Development and Production environments.

Data Transfer

All data channels have a common internal API that is used by scoring pipelines to communicate with data sinks and sources. In both cases, the scoring pipeline acts as a client to data channels, i.e. the scoring pipeline always connects to a data channel.

Data transfer is done using a WebSocket connection. The calling sequence for a scoring pipeline to communicate with a data channel is:

  1. POST /login - establish a login session.
  2. WebSocket /streaming - send (to a data sink) and receive (from a data source) WebSocket messages.
  3. POST /logout - terminate the login session.

Data channels support all of the record field types, with possible data channel specific restrictions.

When a data source sends a record, it sends the record to all connected scoring flows, i.e. data sources use a fire-and-forget sending paradigm. There is no queueing of data, so if a scoring flow is not connected when a record is sent, it never receives the record.

When a data sink receives a record from any connected scoring flow it attempts to deliver it to the target storage mechanism. If the data cannot be delivered successfully by the data sink, it is discarded. There is no guarantee that connected scoring pipelines are returned an error in this case. From the perspective of a scoring flow, data sinks are unreliable.

Result Sets

Data sources may optionally support sending result sets. A result set is a delineated set of records sent to scoring flows. Data sources that support result sets send a start result set, followed by one or more records, and then an end of result set.

When a scoring flow receives an end of result set, it causes the pipeline to automatically exit. This provides a mechanism to process batch style input data, for example from a file.

Note: Scoring Flow deployment page allows users to select to “Run forever”. However, after a single file is processed by a File Data Source, the flow would be stopped and marked “Complete” and no longer accept new input.

Deployment

Data channel deployment is done by the scheduling server using a Helm chart.

This implies that all scheduling features are available for scheduling data channels, specifically:

  • run immediately
  • run at a scheduled time in the future
  • run periodically
  • run for a specific duration, including forever

In addition to the scheduling details, this information is provided when deploying a data channel:

  • name: data channel name
  • specification: Helm chart name
  • version: Helm chart version
  • values: Helm chart values
  • environment: target environment
  • other environments: also expose in these other environments
  • deploy parameters: additional parameters to add to the Helm chart values

Sequence diagram

The deploy sequence diagram is below:

Sequence