MapReduce
You can use the MapReduce activity to create and queue a standard or streaming MapReduce job.
General
In the General tab, you can specify the activity name in the process, establish a connection to HCatalog, and create and queue a standard or streaming MapReduce job.
The following table lists the configurations in the General tab of the MapReduce activity:
Field | Module Property? | Description |
---|---|---|
Name | No | The name to be displayed as the label for the activity in the process. |
HCatalog Connection | Yes | The HCatalog Connection shared resource that is used to create a connection between the plug-in and HCatalog. Click
to select an HCatalog Connection shared resource.
If no matching HCatalog Connection shared resources are found, click Create Shared Resource to create one. For more details, see Creating an HCatalog Connection. |
Streaming | No | If you want to create and run streaming MapReduce jobs, you can select this check box. |
Input | Yes | The path of the input data in Hadoop.
This field is displayed only when you select the Streaming check box. |
Output | Yes | The path of the output data.
This field is displayed only when you select the Streaming check box. |
Mapper | Yes | The path of the mapper program in Hadoop.
This field is displayed only when you select the Streaming check box. |
Reducer | Yes | The path of the reducer program in Hadoop.
This field is displayed only when you select the Streaming check box. |
Jar Name | Yes | The name of the .jar file for the MapReduce activity to use.
This field is displayed only when you clear the Streaming check box. |
Main Class | Yes | The name of the class for the MapReduce activity to use.
This field is displayed only when you clear the Streaming check box. |
Lib Jars | Yes | The comma-separated .jar file to be included in the classpath.
This field is displayed only when you clear the Streaming check box. |
Files | Yes | The comma-separated .jar files to be copied to the MapReduce cluster.
This field is displayed only when you clear the Streaming check box. |
Status
Directory |
Yes | The directory where the status of MapReduce jobs is stored. |
Arguments | No | The program arguments.
|
Define | No | In this field, you can define the Hadoop configuration variables. A variable is associated with a name and a value.
This field is displayed only when you clear the Streaming check box. |
Input
The values that you specify in the Input tab override the ones that you specify in the corresponding fields in the General tab.
The following table lists the input elements in the Input tab of the MapReduce activity:
Input Item | Data Type | Description | |
---|---|---|---|
Input | String | The path of the input data in Hadoop.
This element is displayed only when you select the Streaming check box in the General tab. |
|
Output | String | The path of the output data.
This element is displayed only when you select the Streaming check box in the General tab. |
|
Mapper | String | The path of the mapper program in Hadoop.
This element is displayed only when you select the Streaming check box in the General tab. |
|
Reducer | String | The path of the reducer program in Hadoop.
This element is displayed only when you select the Streaming check box in the General tab. |
|
JarName | String | The name of the .jar file for the MapReduce activity to use.
This element is displayed only when you clear the Streaming check box in the General tab. |
|
ClassName | String | The name of the class for the MapReduce activity to use.
This element is displayed only when you clear the Streaming check box in the General tab. |
|
Libjars | String | The comma-separated .jar file to be included in the classpath.
This element is displayed only when you clear the Streaming check box in the General tab. |
|
Files | String | The comma-separated .jar files to be copied to the MapReduce cluster.
This element is displayed only when you clear the Streaming check box in the General tab. |
|
Status
Directory |
String | The directory where the status of MapReduce jobs is stored. | |
Arguments | String | The program arguments. | |
Defines | String | You can define the Hadoop configuration variables. A variable is associated with a name and a value.
This field is displayed only when you clear the Streaming check box in the General tab. |
|
timeout | Long | The amount of time, in milliseconds, to wait for this activity to complete.
By default, this activity does not time out if you do not specify a value. |
Output
In the Output tab, you can view the job ID of the MapReduce operation.
The following table lists the output element in the Output tab of the MapReduce activity:
Output Item | Data Type | Description |
---|---|---|
jobId | String | The job ID of the MapReduce operation.
Note: You can use the
WaitForJobCompletion activity to wait for the job to complete. The
exitValue output element in the
Output tab of the WaitForJobCompletion activity displays the exit value of MapReduce execution.
|
Fault
In the Fault tab, you can view the error code and error message of the MapReduce activity. See Error Codes for more detailed explanation of errors.
The following table lists the error schema elements in the Fault tab of the MapReduce activity: