HDFS Regular Expression File Reader Input Adapter

Introduction

The TIBCO StreamBase® Regular Expression File Reader For Apache Hadoop Distributed File System (HDFS) input adapter allows StreamBase applications to read custom-formatted text input files, parsed with regular expressions.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

The application specifies an input file, the regular expression used to parse lines of the input file, options for how to time and repeat tuples, how to deal with malformed records, and the target output schema. The input file must be a text file with newlines delimiting records. The adapter parses each line of the file using the provided Java regular expression. Each capture group of the regular expression must correspond to a field of the output schema (the first capture group corresponds to the first schema field and so forth). The fields extracted from the file are coerced to the correct data types according to the schema and tuples are emitted.

Because the input source of this adapter is finite and has no natural timing, this adapter allows the input file to be repeated and the inter-tuple timing to be specified.

The Regular Expression File Reader can read files compressed in the zip or gzip formats, automatically extracting the file to be read from the zip or gzip archive file. For this to work, the adapter requires the target file to have the extension .zip or .gz, and expects to find exactly one text file inside each compressed file. This feature allows the adapter to read market data files provided by a market data vendor in compressed format, without needing to uncompress the files in advance.

HDFS Regular Expression File Reader Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

General Tab

Name: Use this required field to specify or change the name of this instance of this component, which must be unique in the current EventFlow module. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Default Description
File Name none This control is a drop-down list showing eligible files in the current project. Use the drop-down selector to select the file to read and parse. This file is read one line at a time. Each line is parsed using the Format property and emits one tuple.
Default User none The default user if none is provided on the control input port
Use Default Charset Selected If selected, specifies whether the Java platform default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property.
Character Set None The name of the character set encoding that the adapter is to use to read input or write output.
Format none The regular expression used to parse the input file. This must be a Java regular expression as expected by the java.util.regex.Pattern class. For example, ([^,]*),([^,]*) could be used to parse a simple, two-field CSV file.
Period 0 An integer specifying the rate, in milliseconds, at which to read lines from the specified file and emit tuples. Specify 0 or omit this property to emit tuples as quickly as possible.
Repeat 1 An integer specifying the number of times to repeat the input file. If omitted or 1, this reads the input file once and then stops emitting tuples. If set to 0, this repeats the input file indefinitely.
Drop Mismatches selected (true) If selected, records that do not match the regular expression in the Format field are ignored and the next record is immediately examined. Otherwise, a tuple with all fields set to null is emitted when a non-matching input line is encountered.
Timestamp Format MM/dd/yyyy hh:mm:ss aa Specifies the format used to parse timestamp fields extracted from the input file. Specify a string in the form expected by the java.text.SimpleDateFormat class described in the Oracle Java Platform SE reference documentation.
Start Control Port Cleared

Select this check box to give this adapter instance an input port that you can use to control which files to read, and in which order. The input schema for the Start Control Port must have a single field of type string. The schema is typechecked as you define it.

If the File Name property is empty, the adapter begins reading when it receives a control tuple on this port. The path to the file to be read is specified in the only field of the tuple. The path can be absolute, or relative to the working directory of the StreamBase Server process.

If the File Name property specifies a file name, there are two cases:

  1. If a control tuple received on this port has an empty or null string, the file specified in the File Name property is read or re-read.

  2. If a control tuple contains the path to a file, then that specified file is read, as above, ignoring the File Name field.

Start Event Port Cleared

Select this check box to create an output port that emits an informational tuple each time a file is opened or closed. The informational tuple schema has five fields:

  • Type, string

  • Object, string

  • Action, string

  • Status, int

  • Info, string

For a file open event, the event port tuple's Type field is set to "Open", while the Object field is set to the path name of the file being opened.

For a file close event, Type is set to "Close", Object is set to the path name of the file being closed, and Status is set to the number of rows that were read from the file. The Close event tuple is sent after the adapter processes the entire file and emits data tuples for each record in the file.

If you enable the Map Control Port to Event Port option below, the event port tuple also includes a sixth field named ControlInfo of type tuple.

When running in Studio, remember that tuples from more than one output port may appear in the Output Streams view in a different order than they are emitted from the adapter. Thus, you may see the Close event appear on the output of this event port while data tuples are still displaying.

Map Control Port to Event Port Cleared

Select this check box to pass all information received on the control input port to the event output port. When enabled, this property adds a field of type tuple named ControlInfo to the tuple passed to the event output stream. The ControlInfo field contains the entire tuple of the input stream sent to the Control Port.

Log Level INFO Controls the level of verbosity the adapter uses to issue informational traces to the console. This setting is independent of the containing application's overall log level. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

HDFS Tab

Property Data Type Description
Buffer Size (Bytes) int The size of the buffer to be used. If empty, the default is used.
Configuration Files Strings The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml.

Edit Schema Tab

Use the Edit Schema tab to specify the schema of the output tuple for this adapter. For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.

Cluster Aware Tab

Use the settings in this tab to allow this operator or adapter to start and stop based on conditions that occur at runtime in a cluster with more than one node. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with TIBCO Streaming releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.

Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Typechecking and Error Handling

Typechecking fails if the Format property contains an invalid regular expression, if the number of fields in the output schema does not match the number of capture subexpressions in the Format property, or if the Timestamp Format is malformed.

Malformed records (lines that do no match the Format regular expression) will cause the adapter to either ignore the input line or to emit a tuple with all fields set to null, depending on the value of the Drop Mismatches property.

If a field extracted from the file cannot be coerced into the type specified for that field in the schema (for example, if "abc" is extracted where a int field is expected), that field is set to null in the output tuple. Likewise, if a capture group in the Format expression fails to match, but the overall regular expression does match, the corresponding field in the output tuple is set to null.

Connecting to the Amazon S3 File System

The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt URI style.

In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.

You must also add the following dependency to your pom file to include the Apache Hadoop Amazon Web Services Support

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>2.8.1</version>
</dependency>    

The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.

Suspend and Resume Behavior

When suspended, the input file will remain open and the adapter will retain its position in the file. Upon resume, the adapter will continue consuming lines from the input file and outputting tuples.