Contents
The CSV File Reader For Apache HDFS is an embedded adapter that reads comma-separated value (CSV) files from a Hadoop Distributied File System resource.
An embedded adapter is an adapter that runs in the same process as StreamBase Server. The HDFS CSV File Reader reads records from a CSV file, creates tuples from these records, then sends these tuples to the operator downstream from it in its StreamBase application. A record typically consists of a line in the CSV file. If quoted, however, a record can span more than one line in the file.
The HDFS CSV File Reader is similar to an input stream that supplies its own input from a CSV file. As with an input stream, a schema needs to be specified for the HDFS CSV File Reader. The schema used by the HDFS CSV File Reader is specified in the Edit Schema tab of the Properties view in StreamBase Studio.
Note
When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download,
                        build, and install the winutils executable as discussed on numerous Internet resources.
                     
An embedded adapter that reads from a CSV file differs from an external data source, in that it consumes its input file as rapidly as it can. This means the rate at which it consumes records and produce tuples is governed only by the speed at which it can read records from the hadoop HDFS file system and create tuples from them. This would not typically be true of an external data source and it may not be the desired behaviour. A property of the HDFS CSV File Reader, Period, is used to govern the rate at which the HDFS CSV File Reader consumes records. The period is the amount of time that the HDFS CSV File Reader pauses between consuming records. That is, the HDFS CSV File Reader reads one record, processes it to completion, pauses for the specified period, and then reads another record.
The name of the CSV file is specified as a property of the HDFS CSV File Reader.
The size of a CSV file may be limited by practical considerations, and it may not be practical to provide the desired amount of data in a single file. One possible solution is to iterate over one CSV file a number of times, which is provided for by the Repeat property. If 0 is specified for Repeat, then the HDFS CSV File Reader iterates over the CSV file indefinitely.
The HDFS CSV File Reader allows you to specify a string that, when encountered in an incoming CSV field, will be translated
                     into a null tuple field value. The default string is null, but you can specify any string in the NULL String property.
                  
The HDFS CSV File Reader can read files compressed in the zip or gzip formats, automatically extracting the file to be read
                     from the zip or gzip archive file. For this to work, the adapter requires the target file to have the extension .zip, .gz, or .bz2 file and expects to find exactly one CSV file inside each compressed file. This feature allows the adapter to read market
                     data files provided by a market data vendor in compressed format, without needing to uncompress the files in advance.
                  
The HDFS CSV File Reader considers lines starting with the number sign (#), also known the hash character, to be comments and discards them.
                  
This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.
In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.
Name: Use this required field to specify or change the name of this instance of this component. The name must be unique within the current EventFlow module. The name can contain alphanumeric characters, underscores, and escaped special characters. Special characters can be escaped as described in Identifier Naming Rules. The first character must be alphabetic or an underscore.
Adapter: A read-only field that shows the formal name of the adapter.
Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.
Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.
Enable Error Output Port: Select this checkbox to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.
Description: Optionally, enter text to briefly describe the purpose and function of the component. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.
| Property | Data Type | Default | Description | 
|---|---|---|---|
| File Name | String | None | The name of the CSV file to read. You must enter a file name in this field, or enable the Start Control Port, or both. If Start Control Port is disabled, the file specified in this field is the only file to be read by the current adapter instance. If Start Control Port is enabled, a file specified in this field is the default file to be read, as described below. This adapter automatically uncompresses the input file before attempting to interpret the CSV content, if the input file was
                                       compressed with Zip and has the  | 
| User | String | None | The user to access the HDFS file system with if none is provided on the control input port. If no user name is provided by the control port or this field the user running the application is used. | 
| Read As Resource | checkbox | enabled | If enabled and the path given is not absolute then the file will be resolved as a resource file | 
| Use Default Charset | check box | Selected | If selected, specifies whether the Java platform's default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property. | 
| Character Set | string | None | The name of the character set encoding that the adapter is to use to read input or write output. | 
| Start Control Port | check box | Cleared | Select this checkbox to give this adapter instance an input port that you can use to control which CSV files to read, and in which order. The input schema for the Start Control Port must have at least one field of type string. You can optionally define a more complex schema for this port for use with the Map Control Port to Event Port option; in this case, the first field must be of type string and the second field used for user must also be of type string. The schema is typechecked as you define it. If the File Name property is empty, the adapter begins reading when it receives a control tuple on this port. Specify the full, absolute path to the CSV file to be read in the first field of the tuple, and optionally specify the user as the second field. There is no need to surround the full path with quotes if the path contains spaces. If the File Name property specifies a file name, there are two cases: 
 | 
| Start Event Port | check box | Cleared | Select this checkbox to create an output port that emits an informational tuple each time a CSV file is opened or closed. The informational tuple schema has five fields: 
 For a file open event, the event port tuple's  For a file close event,  If an unexpected error occurs,  If you enable the Map Control Port to Event Port option below, the event port tuple also includes a sixth field named
                                          When running in Studio, remember that tuples from more than one output port may appear in the Output Streams view in a different order than they are emitted from the adapter. Thus, you may see the Close event appear on the output of this event port while data tuples are still displaying. | 
| Map Control Port to Event Port | check box | Cleared | Select this checkbox to pass all information received on the control input port to the event output port. When enabled, this
                                       property adds a field of type tuple named  | 
| Log Level | drop-down list | INFO | Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE. | 
| Property | Data Type | Default | Description | 
|---|---|---|---|
| Field Delimiter | string | , (comma) | The delimiter used to separate tokens in the input file. Control characters can be entered as &#ddd;wheredddis the character's ASCII value. For example, use	for a tab character. A special exception also allows the\tcharacter to be used in this field to represent a tab delimiter. | 
| String Quote Character | string | " (double quote) | The optional quote character used in pairs to delimit string constants. | 
| Timestamp Format | string | yyyy-MM-dd HH:mm:ss.SSSZ | The string format used to represent timestamp fields extracted from the input file. The default and ideal is the form expected
                                       by the  If a timestamp value is read that does not match the specified format string, the entire record is discarded and a WARN message
                                       appears on the console that includes the text  | 
| Lenient Parsing | boolean | Selected | Set this to true if you would like to parse timestamp values that do not conform to the specified format using default formats. | 
| NULL String | string | None | The string which, if encountered in a CSV field when reading a file, is to be translated as a null tuple field value for the
                                    corresponding tuple field. If unspecified, the default string is null. You can designate any string to be considered the null value string. | 
| Preserve Whitespace | boolean | Cleared | Set this to true to preserve leading and trailing white space in string fields. | 
| Header Type | drop-down list | No header | The type of header used in the CSV file. Choose one of the following: 
 | 
| Incomplete Records | radio button | Populate with nulls | Specifies what should be done when the adapter reads a record with less than the required number of fields. 
 | 
| Discard Empty Records | check box | Selected | This is a special case to handle empty lines. If rows with some fields must send output, but not empty lines, leave this selected. Unselect this to send empty tuples for empty lines. | 
| Log Warning | check box | Cleared | Select this checkbox if warning messages are to be logged when incomplete records are encountered. If cleared, no warning messages are logged for records with less than the required number of fields. | 
| Property | Data Type | Default | Description | 
|---|---|---|---|
| Repeat | int | 1 | The number of times to iterate over the CSV file. 0 specifies iterating indefinitely. Note that if you send a new file to be read using the control port when this control is set to iterate indefinitely means the new file is not picked up. | 
| Emit Policy | Radio button | Periodic | Specifies whether to emit tuples with a regular period or based on a field in the data. Specify Periodic, the default setting, to use the Period property below. In this case, the two Time field properties are dimmed. Specify Field based to use a field in the output tuple to control the tuple emission rate. In this case, the Period property is dimmed. Specify the field to use in the Time field property, and specify how to use that field with a selection in the Time field meaning property. | 
| Period | int | O | Active only when Emit Policy is Periodic. Specifies the time, in milliseconds, to wait between the processing of records. | 
| Time field meaning | Drop-down list | Emission times relative to the first record. | Active only when Emit Policy is Field based. In the dropdown list, select one of the following options to specify how to use the time field named in the next property. 
 | 
| Time field | string | none | Active only when Emit Policy is Field based. Specifies the name of a field in the output tuple whose values are used to control the tuple emission rate. | 
| Capture Transform Strategy | radio button | FLATTEN | The strategy to use when transforming capture fields for this operator: FLATTEN or NEST. | 
| Property | Data Type | Description | 
|---|---|---|
| Buffer Size (Bytes) | int | The size of the buffer to be used. If empty, the default is used. | 
| Configuration Files | Strings | The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml. | 
Use the Edit Schema tab to specify the schema of the output tuple for this adapter. For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.
Use the settings in this tab to enable this operator or adapter for runtime start and stop conditions in a multi-node cluster. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.
Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.
Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.
Caution
Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.
Typechecking fails if the schema does not have at least one parameter, if the Delimiter is not a single character string, if the QuoteChar is longer than one character, or if the TimestampFormat is malformed. The File Name field fails to typecheck only if it is blank and you have not enabled the Start Control Port option.
A warning is emitted if the File Name property is empty and a null control tuple is received on the Start Control Port.
The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt URI style.
                  
In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.
You must also add the following dependency to your pom file to include the Apache Hadoop Amazon Web Services Support
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>2.8.1</version>
</dependency>    The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.
On suspend, the HDFS CSV File Reader adapter finishes processing the current record, outputs the tuple, and then pauses. The input file remains open and the adapter retains its position in the file. The adapter will stay paused until it is either shutdown or resumed.
On resumption, the HDFS CSV File Reader adapter continues processing with the next record in the input file.
