MapReduce

The Mapreduce activity is used to create and queue a standard Mapreduce job or a streaming Mapreduce job.

General

The General tab has the following fields.

Field Module Property? Description
Name No The name of the activity in the process definition.
HCatalog Connection Yes Click to select an HCatalog Connection shared resource.

If no matching HCatalog Connection shared resources are found, click Create Shared Resource to create one.

Streaming No Select this check box to create and run Mapreduce streaming jobs.
The following four fields are displayed when the Streaming check box is selected.
Input Yes Specifies the path of the input data in Hadoop.
Output Yes Specifies the path of the output data.
Mapper Yes Specifies the path of the mapper program in Hadoop.
Reducer Yes Specifies the path of the reducer program in Hadoop.
The following four fields are displayed when the Streaming check box is cleared.
Jar Name Yes Specifies the name of the .jar file for Mapreduce to use.
Main Class Yes Specifies the name of the class for Mapreduce to use.
Lib Jars Yes Specifies the comma separated .jar file to be included in the classpath.
Files Yes Specifies the comma separated .jar files to be copied to the Mapreduce cluster.
Status Directory Yes Specifies the directory where the status of Mapreduce jobs are stored.
Arguments No Specifies the program arguments.
  • If the Streaming check box is cleared, specify Java main class arguments.
  • If the Streaming check box is selected, specify a list of arguments that contain space-separated strings to pass to the Hadoop streaming utility.

    For example,

    - files /user/hdfs/file
            - D mapred.reduce.task=0
            - input format
            org.apache.hadoop.mapred.lib.NLineInputFormat
            - cmdenv info=wc-reducer
The following field is displayed when the Streaming check box is cleared.
Define No Specifies the Hadoop configuration variables. A variable is associated with a name and a value.

Description

Provide a short description for the activity.

Input

The values specified in this tab takes precedence over the ones in the corresponding fields in the General tab. The following table specifies the possible input of the activity.

Input Item Data Type Description
The following four fields are displayed when the Streaming check box is selected.
Input string Specifies the path of the input data in Hadoop.
Output string Specifies the path of the output data.
Mapper string Specifies the path of the mapper program in Hadoop.
Reducer string Specifies the path of the reducer program in Hadoop.
The following four fields are displayed when the Streaming check box is cleared.
JarName string Specifies the name of the .jar file for Mapreduce to use.
ClassName string Specifies the name of the class for Mapreduce to use.
Libjars string Specifies the comma separated .jar file to be included in the classpath.
Files string Specifies the comma separated .jar files to be copied to the Mapreduce cluster.
StatusDirectory string Specifies the directory where the status of Mapreduce jobs are stored.
Arguments string Specifies the program arguments.
The following field is displayed when the Streaming check box is cleared.
Defines string Specifies the Hadoop configuration variables. A variable is associated with a name and a value.

Output

The output of the activity are as follows.

Output Item Data Type Description
jobId string Returns the job ID of the Mapreduce operation.
Note: You can use the WaitForJobCompletion activity to wait for the job to complete. The exitValue in the Output tab of the WaitForJobCompletion activity shows the exit value of Mapreduce execution.

Fault

The Fault tab lists the exceptions that can be thrown by this activity.

HadoopException Description
msg The error message description returned by the plug-in.
msgCode The error code returned by the plug-in.