Copy To Hadoop
Provides a mechanism for copying relational data into a Hadoop cluster.
Information at a Glance
The Copy to Hadoop operator usually creates a new file on the Hadoop file system for storing the copied data. The column and data type information associated with the database table is used to associate a structure with the Hadoop file.
If the destination file named by the user already exists, the operator can drop the file first, skip the operation, or produce an error. The operator might also be able to append the new data to the existing file, but only if the Hadoop cluster supports this operation.
The copy process can be run in parallel mode or simple mode.
Restrictions
Pig and Sqoop are used to copy your data to Hadoop. Pig does not accept some characters in column names. If a column name contains a character that is not [ A-Z a-z 0-9 _ ] +, the non-conforming character is replaced with an underscore character (_) to create a valid column name. If the data contains columns that could cause a name collision, an underscore and an integer (_1, _2, and so on) are appended to these column names. For example, consider a table with columns named column@a and column#a. In this case, the columns are renamed column_a and column_a, and then differentiated as column_a_1 and column_a_2.
Additionally, Pig requires the first character of a column name to be a letter. If it is not, the column name is prepended with "a," for Team Studio. Therefore, a column named /column is renamed a_column.
Configuration
Parameter | Description |
---|---|
Notes | Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator. |
Copy to | The data source connection through which the data is copied. Default value: CDH5. |
Destination | The folder into which you want the data copied.
Click Choose File to browse the existing Hadoop file structure and specify the destination location within it. |
File Name | The name of the file in which the data is stored. Default value: tohd_0. |
If File Exists | If the destination table specified by
File Name
already exists, select one of the following options.
The operator also can append the new data to the existing file, but only if the Hadoop version supports this operation. |
Copy Mode | The copy method. |
Number of Copy Tasks | For
Parallel copy mode only. The number of parallel processes to use for the Sqoop parallel-processing copy mode.
Default value: 4. |
Divide Up Work By | The database column to use for saving the data in the Hadoop file system structure. You must specify one column. |
Advanced Parameters | Click Configure to display the Advanced Parameters Configuration Dialog Box and set the advanced configuration parameters for parallel copy with Sqoop. |
Fetch Size | The number of entries to read from the database at once. This is equivalent to the
--fetch-size Sqoop parameter.
Default value: 20000. |