Using the BSort Operator

Introduction

  The BSort operator performs an approximate sort of an input stream. It is useful when you anticipate that the input stream might be slightly out of order with respect to some field. Of course, a complete sort is not possible over a potentially unlimited stream of data. However, you can use the BSort operator to order the tuples that are temporarily stored in its buffer, as data flows through the application.

You sort on a specified input tuple field, using one of these two methods:

  • Sort based on the values of the specified input field. This option is available only if the sort field is numeric.

  • Sort on a specified number of tuples.

How you use these options depends on how you want to balance latency and ordering. Setting a high range of values or large buffer size favors more complete ordering, at the expense of higher latency while input is buffered. Conversely, a low setting favors low latency at the expense of making the output less ordered.

An optional third method is to further reorder tuples by grouping on one or more input data fields.

You might use the BSort operator to order all tuples as they arrive, or only those meeting a certain criteria. For example, for a stock ticker feed that contains a time field and a symbol field, you could sort on time for all stocks, or sort on time for stocks of each symbol.

The BSort operator accepts a single input stream and produces a single output stream whose output fields match the fields of the input tuples. The BSort operator is not order sensitive.

Properties: General Tab

Name: Use this field to specify or change the component's name, which must be unique in the application. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports and Error Streams to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Properties: Sort Settings Tab

Use this tab to the field by which to sort, and the set of data in the BSort buffer that is to be sorted. See the Introduction for a general description of sorting.

Field By Which to Sort Tuples

In this property, enter the name of the field on which tuples should be sorted. For example, you could sort inbound tuples according to a Time field.

Sort by

Choose the kind of data that you want to sort by, and specify the bounds of the sort.

Values, with a range of n

Choose this property if the selected sort field is numeric (for example, not a string), and if you want to sort on a range of values in the sort field. Specify the range as a number that matches the data type of the selected sort field. For example, if the sort field is an integer, express the range as an integer.

Each time a tuple arrives, BSort evaluates its order among tuples in the buffer, based on the values in their sort field. If the new tuple would cause the range of sort field values to exceed the specified range, BSort dequeues the lowest tuple or tuples in the group, as required to keep the remaining values within the specified range. Consider the following example:

  • Your BSort's sort field, Count, is an integer. You choose the option to sort by Values, with a range of 4.

  • Two tuples arrive whose Count fields have the values 2 and 3. These values are within the specified range, so they are stored in the BSort buffer, and nothing is dequeued.

  • Next, a tuple arrives with a Count field of 20.

  • BSort dequeues the first two tuples (in order), because they would cause the range of sort field values to exceed the specified limit.

There is no limit to the number of tuples in the buffer as long as their sort fields are within the specified range.

Tuples, using a buffer size of n

Choose this property if you want the sort to be bounded by the size of the buffer instead of the value of the sort field. Specify the buffer size as an integer. To see how a BSort by tuple works, step through the Example in this topic.

Note that for both types of Bsort, tuples with duplicate key fields are stored in the buffer just like any other tuple. As sorting occurs, duplicate tuples are released in the order they arrived.

The following figure shows a selection for sorting on the Time field, with a buffer size of 3:

Properties: Group Options Tab

Optionally, you can use the Group Options tab to specify on or more groups of sorted tuples in your output. (For a description of sorting, see the Introduction section.) Instead of sorting all tuples as a single group, the Bsort operator sorts each group separately.

In the Group By table, add a row for each group that you want to sort on. Identify the group in in the Expression column. The definition is frequently the name of an input field.

For example, suppose your input schema included a timestamp and stock symbol field, and you selected the timestamp field to sort on (as shown in the preceding Sort Settings tab. If you did nothing else, BSort would sort all of your input in timestamp order. Now, suppose you still wanted to order by timestamp, but you were interested in the order of each separate stock symbol. To accomplish this, you could use the Group Options tab, specifying the Symbol field as shown in the following figure:

Properties: Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Example

The following series of diagrams traces the flow of unsorted data through the BSort operator. In this example, assume that we want to sort by tuples, with a buffer size of three. No grouping is specified, so all fields are sorted together. In each step, the single box on the left represents a tuple enqueued on the input port of our BSort operator. When a sort occurs, the output tuple is shown flowing out of the BSort operator's output port on the right.

Note

At each step the diagram shows the contents of the buffer after each tuple arrives and after any tuples are released. Also note that for simplicity, we show only the field by which we are sorting, and we sometimes refer to a tuple by the sort field's value.

  1. The first step shows the first three tuples arriving and filling the buffer. Notice that there are two tuples with duplicate values. Until the buffer is full, no sort event occurs and there is no output:

  2. When the fourth tuple, with a sort field of 5, is enqueued, the buffer limit of 3 is exceeded, so a sort event occurs: the new tuple's sort field is compared with the tuples already in the buffer. The tuple with the lowest sort field (4) it is emitted on the output port, and new the tuple is stored in the buffer. The diagram shows the buffer after the emitted tuple has been flushed:

  3. Next, a fifth tuple enters the BSort operator, and again its value is compared with the other values in the buffer. Its key field, 2, is less than any of the other tuple values in the buffer. Therefore, it is emitted immediately and never stored in the buffer:

  4. Now a tuple with a sort field of 10 enters the BSort operator. It is larger than either of the other tuples, so again the tuple with the lowest value (5) is released, and the new tuple is stored in the buffer.

  5. When a tuple with a key field of 12 is enqueued, another Bsort occurs. Because the existing two tuples both have a key field of 6, the one that arrived first is released.

    When the last tuple arrives and triggers a Bsort, the second duplicate tuple is released.

The preceding steps demonstrate how this BSort operation compares each inbound tuple with the current values in the buffer and, if the buffer is full, emits the tuple with the lowest value.

Finally, compare the order of inputs on the left (by reading the sequence from the top down) to the order of outputs on the right. The order is approximate because, on a potentially unlimited stream of data, the inbound tuples are compared only with the tuple values currently in the BSort buffer. Also notice the latency: no output occurs until the fourth tuple is input. Keep in mind that in your application, sort results will depend on the size of the buffer, and the degree to which the incoming values are unsorted.