Copy To Hadoop

Provides a mechanism for copying relational data into a Hadoop cluster.

Copy to Hadoop

Information at a Glance

Category Load Data
Data source type DB
Sends output to other operators Yes
Data processing tool Sqoop

The Copy to Hadoop operator usually creates a new file on the Hadoop file system for storing the copied data. The column and data type information associated with the database table is used to associate a structure with the Hadoop file.

If the destination file named by the user already exists, the operator can drop the file first, skip the operation, or produce an error. The operator might also be able to append the new data to the existing file, but only if the Hadoop cluster supports this operation.

The copy process can be run in parallel mode or simple mode.

Input

A data set or an operator that produces results in a database.

Restrictions

Pig and Sqoop are used to copy your data to Hadoop. Pig does not accept some characters in column names. If a column name contains a character that is not [ A-Z a-z 0-9 _ ] +, the non-conforming character is replaced with an underscore character (_) to create a valid column name. If the data contains columns that could cause a name collision, an underscore and an integer (_1, _2, and so on) are appended to these column names. For example, consider a table with columns named column@a and column#a. In this case, the columns are renamed column_a and column_a, and then differentiated as column_a_1 and column_a_2.

Additionally, Pig requires the first character of a column name to be a letter. If it is not, the column name is prepended with "a," for Team Studio. Therefore, a column named /column is renamed a_column.

Note: Because Pig and Sqoop both contain a bug that can cause errors when column names contain backslashes, you should not use backslashes in your column names.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Copy to The data source connection through which the data is copied. Default value: CDH5.
Destination The folder into which you want the data copied.

Click Choose File to browse the existing Hadoop file structure and specify the destination location within it.

File Name The name of the file in which the data is stored. Default value: tohd_0.
If File Exists If the destination table specified by File Name already exists, select one of the following options.
  • Drop (the default) - Drop the table first.
  • Extend - Append the new data.
  • Error - Report an error and stop execution of the workflow.
  • Skip - Skip the operation.

The operator also can append the new data to the existing file, but only if the Hadoop version supports this operation.

Copy Mode The copy method.
  • Parallel (the default) - Copy in parallel using the underlying Sqoop technology.
  • Simple - Copy using the batch-processing copy process.
Number of Copy Tasks For Parallel copy mode only. The number of parallel processes to use for the Sqoop parallel-processing copy mode.

Default value: 4.

Divide Up Work By The database column to use for saving the data in the Hadoop file system structure. You must specify one column.
Advanced Parameters Click Configure to display the Advanced Parameters Configuration Dialog Box and set the advanced configuration parameters for parallel copy with Sqoop.
Fetch Size The number of entries to read from the database at once. This is equivalent to the --fetch-size Sqoop parameter.

Default value: 20000.

Output

The output of the Copy to Hadoop operator can be used as the input to any operator that accepts Hadoop files.

Visual Output
A preview of the rows of the resulting copied data.
Data Output
A Hadoop file that corresponds to the destination file.