Configure Columns: Text Files

The Configure Columns dialog changes options depending on the file type specified in the Hadoop File properties dialog. The options described in this topic are available for text files.

Column configuration Description
Vertical/Horizontal File View If the text file contains a large number of columns, then you can click the switch icon (Switch icon 
      ), located in the top right corner, to change the display of the columns between vertical and horizontal. For files that have more than 300 columns, only the vertical view is available.
Escape and Quote Characters Specify the escape and quote characters used in the file.
Delimiter

Select the delimiter from the list.

  • Comma
  • Tab
  • Semicolon
  • Space
  • Control-A
  • Other (Choosing Other specifies that a custom character is used as the delimiter.)

Headers

When TIBCO Data Science – Team Studio opens the Configure Columns dialog, TIBCO Data Science – Team Studio uses heuristics to determine if the first row of data is a header row, and selects or clears the control First row contains header based on this determination. You can select or clear this property manually.

  • If TIBCO Data Science – Team Studio determines that the first row contains header information, then the contents of the row are used as the default column names, and the setting First row contains header is selected.
  • If the source data does not have a header row, then clear First row contains header.
  • If the file does not include headers, but the header information is available in a separate file, then you can set the header file. Click Load header from file and then browse to and select a file from the Hadoop file selector.
Data Columns

TIBCO Data Science – Team Studio attempts to infer the correct column names and data types by using a sample of the first few rows. When the dialog is displayed, each column is preceded by the inferred data type.

You can change these settings by providing new column names and data types.

The drop-down list box provides a list of standard data types.

  • chararray
  • int
  • long
  • float
  • double
  • bytearray
  • sparse
  • datetime
  • datetimeyyyy-MM-dd'T'HH:mm:ss
  • datetimeyyyyMMdd HH:mm
  • datetimeyyyy-MM-dd
  • datetimeHH:mm:ss
  • datetimeyyyy-MM-dd'T'HH:mm:ss.SSSZ
  • datetimeMM-dd-yyyy
  • datetimeMM/dd/yyyy
  • datetimedd-MM-yyyy
  • datetimeyyyy-MM-dd HH:mm:ss
  • datetimeyyyy-MM-dd'T'HH:mm:ss.SSSZZ

You can change the data type for multiple columns. Set the view to horizontal format, select the checkboxes for the desired columns, and then click Configure Selected.

Configure Selected drop-down list

The list of columns can also be filtered with the filter field.

Note: For datetime data types, if the source data uses the ISO datetime format, you should select the basic datetime data type option to preserve the flexibility of the ISO formatting. ISO provides an international data exchange format framework for datetime data types that converts all datetime values into the number of milliseconds since 1970. For more details, see ISO DateTime Format.

If the source data is not in ISO datetime format, you must select from the list of predefined formats the specific datetime format of the imported data file.

You can modify the list of specific datetime data type formats for the application using Datetime Format Preferences.

Default datetime formats in TIBCO Data Science – Team Studio are listed in the drop-down list box.

Although a list of datetime formats are pre-defined, you can override the defaults at run-time and specify a different datetime format (for a one-time Hadoop file import) using Joda-Time API formatting.