Contents
The Spotfire Streaming CSV File Writer For Apache Hadoop Distributed File System (HDFS) is suitable for saving tuples to comma-separated value (CSV) format files.
There are a number of options that help you control how and when files are created, how big files can be, and how strings are saved in the file.
Tip
In the StreamBase application that contains the HDFS CSV File Writer adapter, if the output CSV file will be used by an application that requires a specific order of fields, and the fields in the stream's tuples do not match that order, you can use a Map operator to arrange the fields as needed.
Tip
If you write a CSV file containing timestamps and import it into Microsoft Excel,
by default Excel does not display fractions of seconds. To display times with
millisecond precision, in Excel assign timestamp columns a custom format that you
define as hh:mm:ss.000
.
Note
When using the HDFS core libraries on Windows, you may see a console message
similar to Failed to locate the winutils binary in the hadoop
binary path
. The Apache Hadoop project acknowledges
that this is an error to be fixed in a future Hadoop release. In the meantime,
you can either ignore the error message as benign, or you can download, build, and
install the winutils executable as discussed on numerous Internet resources.
This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.
In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.
Name: Use this required field to specify or change the name of this instance of this component. The name must be unique within the current EventFlow module. The name can contain alphanumeric characters, underscores, and escaped special characters. Special characters can be escaped as described in Identifier Naming Rules. The first character must be alphabetic or an underscore.
Adapter: A read-only field that shows the formal name of the adapter.
Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.
Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.
Enable Error Output Port: Select this checkbox to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.
Description: Optionally, enter text to briefly describe the purpose and function of the component. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.
Property | Data Type | Default | Description |
---|---|---|---|
File Name | string | none |
Name of file to write to. If this adapter is configured with a maximum file
size, then this filename is appended with the date and time when the
maximum file size is reached.
When using the Compress Data option
described below, StreamBase recommends using the |
User | string | none | The user if none is provided on the control input port |
Use Default Charset | check box | Selected | If selected, specifies whether the Java platform's default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property. |
Character Set | string | None | The name of the character set encoding that the adapter is to use to read input or write output. |
Include Header In File | check box | true (selected) | Specifies the Inclusion of an optional row at the top of each file with the name of each column. |
If File Does not Exist | radio buttons | Create new file | Specifies the action to take if the specified CSV file does not exist when the adapter is started: Create new file or Fail. |
If File Exists | drop-down list | Append to existing file | Specifies the action to take if the specified CSV file already exists when the adapter is started: Append to existing file, Truncate existing file, or Fail. |
Open File During Initialization | check box | false (cleared) | When selected, the output file is created, or opened and truncated, even if the adapter is not configured to start with the application, or the container in which the adapter is running has not started. |
Compress data | check box | false (cleared) | If selected, the adapter compresses its output in gzip format. |
Start control port | check box | false (cleared) | Select this checkbox to give this adapter instance a control port you can use to specify a new output file name. The schema for the control port must begin with a field of type string used to convey the name of the new file to open. When a tuple is enqueued to this port, the existing file, if any, is closed, and the new file is opened. |
Start event port | check box | false (cleared) |
Select this checkbox to create an output port that emits an informational tuple each time a CSV file is opened or closed. The informational tuple schema has five fields:
For a file open event, the event port tuple's Type field is set to
For a file close event, the event port tuple's Type field is set to For both open and close operations, the Status field is set to 0 to indicate success or –1 to indicate failure. The Info field always contains a text message describing the event.
For data events, the event port tuple's Type field is set to |
Pass Through Data To Event Port | check box | false (cleared) | If enabled, when data tuples are passed in and a status event occurs, the data tuple is passed to the status event. |
Property | Data Type | Default | Description |
---|---|---|---|
Field Delimiter | string | , (comma) |
Specifies the character used to mark the end of one field and the beginning
of another. Control characters can be entered as &#ddd; where
ddd is the character's ASCII value.
|
String Quote Character | string | " (double quote) | Specifies the character to use to quote strings when they contain the field delimiter. |
String Quote Option | drop-down list | Quote if necessary | Specifies when string fields are quoted in the CSV file: Quote if necessary, Always quote, or Never quote. |
Null Value Representation | string |
null
|
Specifies the string to write when a field is null. |
Timestamp Format | string | yyyy-MM-dd HH:mm:ss.SSSZ | Determines the format of all timestamp objects written to the CSV output. |
Add Timestamp | drop-down list | None |
Optionally prepend or append a timestamp to each CSV row of output. The
column name in the header row will be Timestamp
|
Capture Transform Strategy | radio button | FLATTEN | The strategy to use when transforming capture fields for this operator: FLATTEN or NEST. |
Property | Data Type | Default | Description |
---|---|---|---|
Max File Size | int | 0 (no rollover) | Maximum size, in bytes, of the file on disk. If the file reaches this limit, it is renamed with the current timestamp and new data is written to the current name specified in the File Name property. This field must contain either 0 to disable file size rolling, or an integer greater than 65535. |
Max Roll Seconds | int | 0 (no rollover) |
The maximum number of seconds before file names are rolled over as described in the previous row. The Roll Period and Max Roll Seconds properties are mutually exclusive. |
Roll Period | drop-down list | None |
Select among None, Weekly, Daily, and
Hourly to specify the time period for
automatic file rollover.
|
Roll Hour | int | 0 | Allows for selecting which hour (0-23) to perform the file roll when Daily is selected as the roll period. Defaults to 0, meaning 12:00:00 AM. |
Roll Minute | int | 0 | Allows for selecting which minute (0-59) to perform the file roll when Daily is selected as the roll period. Defaults to 0, meaning 12:00:00 AM. |
Roll Second | int | 0 | Allows for selecting which second (0-59) to perform the file roll when Daily is selected as the roll period. Defaults to 0, meaning 12:00:00 AM. |
Check for Roll at Startup | check box | false (cleared) | If selected, causes the adapter to roll the file at startup if, based on the file's last modification time, the configured roll period, and the current time, the file would have been rolled before the adapter was started. |
Flush Interval | int | 1 | Specifies how often, in seconds, to force tuples to disk. Set this value to zero to flush immediately. |
Sync on flush | check box | false (cleared) | If selected, StreamBase syncs operating system buffers to the file system on flush, to make sure that all changes are written. Using this option incurs a significant performance penalty. |
Property | Data Type | Default | Description |
---|---|---|---|
Throttle Error Messages | check box | false (cleared) | Specifies showing any particular error message only once. |
Log Level | drop-down list | INFO | Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE. |
Property | Data Type | Description |
---|---|---|
Buffer Size (Bytes) | Int | The size of the buffer to be used. If empty, the default is used. |
Replication | Short | The required block replication for the file. If empty, the server default is used, and only used during file creation. |
Block Size (Bytes) | Long | The default data block size. If empty, the server default is used, and only used during file creation. |
Configuration Files | String |
The HDFS configuration files to use when connecting to the HDFS file
system. For example, use the standard file, core-defaults.xml .
|
Use the settings in this tab to enable this operator or adapter for runtime start and stop conditions in a multi-node cluster. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.
Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.
Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.
Caution
Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.
Typechecking fails in the following circumstances:
-
The File Name is null or a zero length string.
-
The Flush Interval is less than zero.
-
The Max File Size is less than zero.
-
More than one string quote character is specified.
-
More than one field delimiter is specified.
-
An illegal string quote option is specified.
-
The Max Roll Seconds value greater than zero and the Roll Period option selected is other than None.
On suspend, this adapter stops processing tuples, flushes all tuples to disk, and closes the current CSV file.
On resumption, it reopens the current CSV file and begins processing tuples again.