MapReduce

You can use the MapReduce activity to create and queue a standard or streaming MapReduce job.

General

In the General tab, you can specify the activity name in the process, establish a connection to HCatalog, and create and queue a standard or streaming MapReduce job.

The following table lists the configurations in the General tab of the MapReduce activity:

Field Module Property? Description
Name No The name to be displayed as the label for the activity in the process.
HCatalog Connection Yes The HCatalog Connection shared resource that is used to create a connection between the plug-in and HCatalog. Click to select an HCatalog Connection shared resource.

If no matching HCatalog Connection shared resources are found, click Create Shared Resource to create one. For more details, see Creating an HCatalog Connection.

Streaming No If you want to create and run streaming MapReduce jobs, you can select this check box.
Input Yes The path of the input data in Hadoop.

This field is displayed only when you select the Streaming check box.

Output Yes The path of the output data.

This field is displayed only when you select the Streaming check box.

Mapper Yes The path of the mapper program in Hadoop.

This field is displayed only when you select the Streaming check box.

Reducer Yes The path of the reducer program in Hadoop.

This field is displayed only when you select the Streaming check box.

Jar Name Yes The name of the .jar file for the MapReduce activity to use.

This field is displayed only when you clear the Streaming check box.

Main Class Yes The name of the class for the MapReduce activity to use.

This field is displayed only when you clear the Streaming check box.

Lib Jars Yes The comma-separated .jar file to be included in the classpath.

This field is displayed only when you clear the Streaming check box.

Files Yes The comma-separated .jar files to be copied to the MapReduce cluster.

This field is displayed only when you clear the Streaming check box.

Status

Directory

Yes The directory where the status of MapReduce jobs is stored.
Arguments No The program arguments.
  • If you select the Streaming check box, specify a list of program arguments that contain space-separated strings to pass to the Hadoop streaming utility. For example:
    - files /user/hdfs/file
            - D mapred.reduce.task=0
            - input format
            org.apache.hadoop.mapred.lib.NLineInputFormat
            - cmdenv info=wc-reducer
  • If you clear the Streaming check box, specify the Java main class arguments.
Define No In this field, you can define the Hadoop configuration variables. A variable is associated with a name and a value.

This field is displayed only when you clear the Streaming check box.

Description

In the Description tab, you can enter a short description for the MapReduce activity.

Input

The values that you specify in the Input tab override the ones that you specify in the corresponding fields in the General tab.

The following table lists the input elements in the Input tab of the MapReduce activity:

Input Item Data Type Description
Input String The path of the input data in Hadoop.

This element is displayed only when you select the Streaming check box in the General tab.

Output String The path of the output data.

This element is displayed only when you select the Streaming check box in the General tab.

Mapper String The path of the mapper program in Hadoop.

This element is displayed only when you select the Streaming check box in the General tab.

Reducer String The path of the reducer program in Hadoop.

This element is displayed only when you select the Streaming check box in the General tab.

JarName String The name of the .jar file for the MapReduce activity to use.

This element is displayed only when you clear the Streaming check box in the General tab.

ClassName String The name of the class for the MapReduce activity to use.

This element is displayed only when you clear the Streaming check box in the General tab.

Libjars String The comma-separated .jar file to be included in the classpath.

This element is displayed only when you clear the Streaming check box in the General tab.

Files String The comma-separated .jar files to be copied to the MapReduce cluster.

This element is displayed only when you clear the Streaming check box in the General tab.

Status

Directory

String The directory where the status of MapReduce jobs is stored.
Arguments String The program arguments.
Defines String You can define the Hadoop configuration variables. A variable is associated with a name and a value.

This field is displayed only when you clear the Streaming check box in the General tab.

timeout Long The amount of time, in milliseconds, to wait for this activity to complete.

By default, this activity does not time out if you do not specify a value.

Output

In the Output tab, you can view the job ID of the MapReduce operation.

The following table lists the output element in the Output tab of the MapReduce activity:

Output Item Data Type Description
jobId String The job ID of the MapReduce operation.
Note: You can use the WaitForJobCompletion activity to wait for the job to complete. The exitValue output element in the Output tab of the WaitForJobCompletion activity displays the exit value of MapReduce execution.

Fault

In the Fault tab, you can view the error code and error message of the MapReduce activity. See Error Codes for more detailed explanation of errors.

The following table lists the error schema elements in the Fault tab of the MapReduce activity:

Error Schema Element Data Type Description
msg String The error message description that is returned by the plug-in.
msgCode String The error code that is returned by the plug-in.