HDFS CSV File Reader Input Adapter

Introduction

The TIBCO StreamBase® CSV File Reader For Apache HDFS is an embedded adapter that reads comma-separated value (CSV) files from a Hadoop Distributied File System resource.

An embedded adapter is an adapter that runs in the same process as StreamBase Server. The HDFS CSV File Reader reads records from a CSV file, creates tuples from these records, then sends these tuples to the operator downstream from it in its StreamBase application. A record typically consists of a line in the CSV file. If quoted, however, a record can span more than one line in the file.

The HDFS CSV File Reader is similar to an input stream that supplies its own input from a CSV file. As with an input stream, a schema needs to be specified for the HDFS CSV File Reader. The schema used by the HDFS CSV File Reader is specified in the Edit Schema tab of the Properties view in StreamBase Studio.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

An embedded adapter that reads from a CSV file differs from an external data source, in that it consumes its input file as rapidly as it can. This means the rate at which it consumes records and produce tuples is governed only by the speed at which it can read records from the hadoop HDFS file system and create tuples from them. This would not typically be true of an external data source and it may not be the desired behaviour. A property of the HDFS CSV File Reader, Period, is used to govern the rate at which the HDFS CSV File Reader consumes records. The period is the amount of time that the HDFS CSV File Reader pauses between consuming records. That is, the HDFS CSV File Reader reads one record, processes it to completion, pauses for the specified period, and then reads another record.

The name of the CSV file is specified as a property of the HDFS CSV File Reader.

The size of a CSV file may be limited by practical considerations, and it may not be practical to provide the desired amount of data in a single file. One possible solution is to iterate over one CSV file a number of times, which is provided for by the Repeat property. If 0 is specified for Repeat, then the HDFS CSV File Reader iterates over the CSV file indefinitely.

The HDFS CSV File Reader allows you to specify a string that, when encountered in an incoming CSV field, will be translated into a null tuple field value. The default string is null, but you can specify any string in the NULL String property.

The HDFS CSV File Reader can read files compressed in the zip or gzip formats, automatically extracting the file to be read from the zip or gzip archive file. For this to work, the adapter requires the target file to have the extension .zip, .gz, or .bz2 file and expects to find exactly one CSV file inside each compressed file. This feature allows the adapter to read market data files provided by a market data vendor in compressed format, without needing to uncompress the files in advance.

The HDFS CSV File Reader considers lines starting with the number sign (#), also known the hash character, to be comments and discards them.

HDFS CSV File Reader Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.

Use the StreamSQL names of the adapter's properties when using this adapter in a StreamSQL program with the APPLY JAVA statement, or when specifying properties for a CSV-to-stream container connection (which uses this adapter's technology).

General Tab

Name: Use this field to specify or change the component's name, which must be unique in the application. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class: A field that shows the fully qualified class name that implements the functionality of this adapter. Use this class name when loading the adapter in StreamSQL programs with the APPLY JAVA statement. You can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start with application: If this field is set to Yes or to a module parameter that evaluates to true, an instance of this adapter starts as part of the containing StreamBase Server. If this field is set to No or to a module parameter that evaluates to false, the adapter is loaded with the server, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager. With this option set to No or false, the adapter does not start even if the application as a whole is suspended and later resumed. The recommended setting is selected by default.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports and Error Streams to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Data Type Default Description StreamSQL Property
File Name String None

The name of the CSV file to read. You must enter a file name in this field, or enable the Start Control Port, or both. If Start Control Port is disabled, the file specified in this field is the only file to be read by the current adapter instance. If Start Control Port is enabled, a file specified in this field is the default file to be read, as described below.

This adapter automatically uncompresses the input file before attempting to interpret the CSV content, if the input file was compressed with Zip and has the .zip extension, with Gzip and has the .gz extension, or with Bzip2 and has the .bz2 extension.

FileName
User String None The user to access the HDFS file system with if none is provided on the control input port. If no user name is provided by the control port or this field the user running the application is used.  
Use Default Charset check box Selected If selected, specifies whether the Java platform default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property. UseDefaultCharset
Character Set string None The name of the character set encoding that the adapter is to use to read input or write output. Charset
Start Control Port check box Cleared

Select this check box to give this adapter instance an input port that you can use to control which CSV files to read, and in which order. The input schema for the Start Control Port must have at least one field of type string. You can optionally define a more complex schema for this port for use with the Map Control Port to Event Port option; in this case, the first field must be of type string and the second field used for user must also be of type string. The schema is typechecked as you define it.

If the File Name property is empty, the adapter begins reading when it receives a control tuple on this port. The path to the CSV file to be read is specified in the first field of the tuple and the user is optionally specified as the second field.

If the File Name property specifies a file name, there are two cases:

  1. If a control tuple received on this port has an empty or null string, the file specified in the File Name property is read or re-read.

  2. If a control tuple contains the path to a CSV file, then that specified file is read, as above, ignoring the File Name field.

StartControlPort
Start Event Port check box Cleared

Select this check box to create an output port that emits an informational tuple each time a CSV file is opened or closed. The informational tuple schema has five fields:

  • Type, string

  • Object, string

  • Action, string

  • Status, int

  • Info, string

For a file open event, the event port tuple's Type field is set to "Open", while the Object field is set to the path name of the CSV file being opened.

For a file close event, Type is set to "Close", Object is set to the path name of the CSV file being closed, and Status is set to the number of rows that were read from the CSV file. The Close event tuple is sent after the adapter processes the entire CSV file and emits data tuples for each record in the file.

If you enable the Map Control Port to Event Port option below, the event port tuple also includes a sixth field named ControlInfo of type tuple.

When running in Studio, remember that tuples from more than one output port may appear in the Application Output view in a different order than they are emitted from the adapter. Thus, you may see the Close event appear on the output of this event port while data tuples are still displaying.

StartEventPort
Map Control Port to Event Port check box Cleared

Select this check box to pass all information received on the control input port to the event output port. When enabled, this property adds a field of type tuple named ControlInfo to the tuple passed to the event output stream. The ControlInfo field contains the entire tuple of the input stream sent to the Control Port.

MapControlPort
Log Level drop-down list INFO Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE, and ALL. LogLevel

Parsing Options Tab

Property Data Type Default Description StreamSQL Property
Field Delimiter string , (comma) The delimiter used to separate tokens in the input file. Control characters can be entered as &#ddd; where ddd is the character's ASCII value. For example, use 	 for a tab character. A special exception also allows the \t character to be used in this field to represent a tab delimiter. Delimiter
String Quote Character string " (double quote) The optional quote character used in pairs to delimit string constants. QuoteChar
Timestamp Format string MM/dd/yyyy HH:mm:ss aa

The string format used to represent timestamp fields extracted from the input file. The default and ideal is the form expected by the java.text.SimpleDateFormat class described in the Oracle Java Platform SE reference documentation.

If a timestamp value is read that does not match the specified format string, the entire record is discarded and a WARN message appears on the console that includes the text invalid timestamp value.

TimestampFormat
Lenient Parsing boolean Selected

Set this to true if you would like to parse timestamp values that do not conform to the specified format using default formats.

LenientTimestampParsing
NULL String string None The string which, if encountered in a CSV field when reading a file, is to be translated as a null tuple field value for the corresponding tuple field. If unspecified, the default string is null. You can designate any string to be considered the null value string. NullString
Preserve Whitespace boolean Cleared Set this to true to preserve leading and trailing white space in string fields. PreserveWhitespace
Header Type drop-down list No header

The type of header used in the CSV file. Choose one of the following:

No header

The CSV file contains no header and is to be parsed without a header.

Ignore header

The first line of the CSV file is to be considered the header. The first line is skipped and not read into the adapter as a tuple.

Read header

The first line of the CSV file is to be considered the header, and compared against the schema used in your StreamBase application. Fields that do not match the schema are not parsed (including the subsequent fields in the following rows), and fields outside the range of the header are not parsed. Field order does not matter, because the adapter reorganizes the CSV file to fit the schema of the StreamBase application.

(StreamSQL parameter name: HeaderTypeOption)

HeaderTypeOption
Incomplete Records radio button Populate with nulls

Specifies what should be done when the adapter reads a record with less than the required number of fields.

Discard

Discard records with less than the required number of fields.

Populate with nulls

When records with less than the required number of fields are encountered, process the records after populating the unspecified fields with nulls.

IncompleteRecordsMode
Discard Empty Records check box Selected

This is a special case to handle empty lines. If rows with some fields must send output, but not empty lines, leave this selected. Unselect this to send empty tuples for empty lines.

DiscardEmptyRecords
Log Warning check box Cleared

Select this check box if warning messages are to be logged when incomplete records are encountered. If cleared, no warning messages are logged for records with less than the required number of fields.

LogWarningForIncomplete

Emit Options Tab

Property Data Type Default Description StreamSQL Property
Repeat int 1 The number of times to iterate over the CSV file. 0 specifies iterating indefinitely. Repeat
Emit Policy Radio button Periodic Specifies whether to emit tuples with a regular period or based on a field in the data.

Specify Periodic, the default setting, to use the Period property below. In this case, the two Time field properties are dimmed.

Specify Field based to use a field in the output tuple to control the tuple emission rate. In this case, the Period property is dimmed. Specify the field to use in the Time field property, and specify how to use that field with a selection in the Time field meaning property.

EmitTiming
Period int O Active only when Emit Policy is Periodic. Specifies the time, in milliseconds, to wait between the processing of records. Period
Time field meaning Drop-down list Emission times relative to the first record. Active only when Emit Policy is Field based. In the drop-down list, select one of the following options to specify how to use the time field named in the next property.
  • Absolute delays before the first record

  • Emission times relative to the first record

  • Emission times relative to zero

TimeBasedEmitMode
Time field string none Active only when Emit Policy is Field based. Specifies the name of a field in the output tuple whose values are used to control the tuple emission rate. TimeBasedEmitField
Capture Transform Strategy radio button FLATTEN The strategy to use when transforming capture fields for this operator: FLATTEN or NEST. CaptureTransformStrategy

HDFS Tab

Property Data Type Description
Buffer Size (Bytes) int The size of the buffer to be used, if empty the default will be used.
Configuration Files Strings The HDSF configuration files to use when connecting to the HDFS file system for example 'core-defaults.xml' is the standard file to use.

Edit Schema Tab

Use the Edit Schema tab to specify the schema of the output tuple for this adapter. For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Typechecking

Typechecking fails if the schema does not have at least one parameter, if the Delimiter is not a single character string, if the QuoteChar is longer than one character, or if the TimestampFormat is malformed. The File Name field fails to typecheck only if it is blank and you have not enabled the Start Control Port option.

A warning is emitted if the File Name property is empty and a null control tuple is received on the Start Control Port.

Suspend and Resume Behavior

On suspend, the HDFS CSV File Reader adapter finishes processing the current record, outputs the tuple, and then pauses. The input file remains open and the adapter retains its position in the file. The adapter will stay paused until it is either shutdown or resumed.

On resumption, the HDFS CSV File Reader adapter continues processing with the next record in the input file.

Back to Top ^