Import Excel (HD)

Imports an Excel workbook sheet (or a portion of the sheet) as an HDFS input.

Information at a Glance

Category Load Data
Data source type HD
Sends output to other operators Yes1
Data processing tool
Note: The Import Excel (HD) operator is for Hadoop data only. For database data, use the Import Excel (DB) operator.

The Excel workbook can be stored in HDFS or in the current workspace.

Formula cells, styles, dates, currencies, percentages, and so on are supported, and are parsed as numeric values. Non-tabular data such as images and pivot tables are skipped. Hidden columns and protected sheets are parsed as normal.

Input

Import Excel (HD) is a source operator. No inputs are required.
Bad or Missing Values
Blank cells or empty cells are converted to null values.

If a datetime type is specified in the Column Metadata File parameter but the Excel sheet contains values that cannot be parsed with this format, null values are used.

Restrictions

Excel files are read on the Team Studio server. Depending on the memory available on your instance, loading very large Excel files on this server might require a large amount of memory and cause out-of-memory issues. For more information, see Apache POI limitations at https://poi.apache.org/spreadsheet/limitations.html.

Team Studio uses the configuration parameter custom_operators, set in the alpine.conf file, to avoid loading files that are too large. If the Excel file is bigger than this limit, it does not load and an error message is displayed. The default value is 30.0 (MB). The administrator of your Team Studio instance can modify the default value.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Data Source (Hadoop) The data source where the output should be stored (and where the Column Metadata Table is stored).
Hadoop File The Excel workbook if stored in HDFS.
Note: Supported formats are .xls, .xslx and .xlsm.

If this field is left blank, a work file must be identified (in the Work File parameter below).

Chorus Workfile The Excel workbook if stored in the current workspace. Only workbooks with the following extensions are displayed: .xls, .xlsx, and .xlsm.

If this field is left blank, a Hadoop file must be identified (in the Data Source (Hadoop) parameter above).

Sheet Number The number of the sheet to extract (the first sheet is 1).
Top Left Corner Cell The cell address that defines the top-left cell of the data portion to extract in the selected sheet (for example, B10).
Note: If your workbook sheet contains a header, skip it and select a cell on the following row. This is necessary because the header is read separately in the Column Metadata File (CSV) parameter, from a CSV file that includes both names and column types.
Right Cut-Off Column Letter An optional column letter that defines where the portion of data to extract should be cut at the right. If not specified, the data extraction is cut at the last defined cell of the first row selected (row number in Top Left Corner Cell).
Note: If this parameter is not specified and the first row selected is empty or not defined, an error is displayed.
Bottom Cut-Off Row Number An optional row number that defines where the portion of data to extract should be cut at the bottom. If not specified, the data extraction is cut at the last defined row in the sheet.
Column Metadata File (CSV) The HDFS file (the CSV file) that defines the header and column types for the output. The file should contain the following two rows of the same length.
  • The first row for column names.
  • The second row for column types (supported types are int, long, double, float, chararray, and datetime, with format specified (for example, datetime yyyy-MM-dd).

This file is read at runtime and the output schema is available to the subsequent operator only after the operator is run.

Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.

Output

Summary tab
Summary of the parameters selected, as shown in the following image.



Output tab
Data preview of the data extracted from the Excel workbook sheet, as shown in the following image.



Data Output
A single tabular data set that is extracted from the sheet, and can be transmitted to subsequent operators only after the operator is run.
1 The full output schema is not available until you step-run the operator. After the operator is run, the output schema automatically updates, and subsequent operators are either validated or turn red depending on the structure of the output data.