Python Execute (HD)

Runs a Jupyter notebook stored in your current workspace from a workflow in Team Studio.

Information at a Glance

Category Tools
Data source type HD
Sends output to other operators Yes
Data processing tool PySpark
Note: The Python Execute (HD) operator is for Hadoop data only. For database data, use the Python Execute (DB) operator.
Notebook setup: For a notebook to be usable with the Python Execute operator, it must have the automatically generated tag Ready For Python Execute visible in your workspace. This attribute is set if the following conditions are met.
  • At least one input or output is specified in the notebook with argument use_input_substitution = True or use_output_substitution = True.
  • Notebook input(s) argument execution_label are distinct and exclusively one of the following strings: 1, 2 or 3.
  • For example, your Notebook code might look something like this:

    df_account=cc.read_input_table(table_name='account', schema_name='demo', database_name='miner_demo',use_input_substitution=True, execution_label="1")

  • Inputs/output defined with use_input_substitution = True must all be Hadoop inputs (in this case, the notebook is usable with the Python Execute (HD) operator).

Input

From zero to three inputs to use as substitute inputs for the notebook selected, depending on the number of inputs for substitution the notebook configuration allows.

You can substitute up to three inputs, or use the inputs defined in the notebook if you prefer not to specify substitutions. To run Python Execute, each substituted input must contain a superset of columns in the corresponding notebook input with compatible data types. One data set can be output, or zero outputs if this is a terminal operator in your workflow.

Depending on the notebook configuration for inputs and output, the operator can be a source operator (if no inputs are selected for substitution in the notebook), or a terminal operator (if no output is specified in the notebook). If a single output is specified, the operator transmits this output to subsequent operators.

Bad or Missing Values
Missing values are not removed if present in the input(s). They should be handled directly in the notebook or in preceding steps of the workflow.

Restrictions

If the notebook selected has no tabular output defined with argument use_output_substitution = True, the Python Execute operator transmits no data to subsequent operators and is considered a terminal operator. Although following operators cannot run, the user can still draw a connection with subsequent operators.

Parquet and Avro inputs are supported only with PySpark notebooks (that is, the cc.read_input_file method in the notebook should have the sqlContext argument specified).

If the notebook selected is set up to transmit an output that contains variables with datetime format, the Python Execute operator transmits those as string variables to the next operator. (The user can then convert those to the correct format in a Variables operator.)

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Notebook Select the Python/PySpark notebook to run in your current workspace. To appear in this list, notebooks must be set up for use with Python Execute.
Note: Clicking Open Notebook Selected opens the notebook in a new browser tab.
Substitute Input 1 Optional. Select the connected input to use as a substitute for notebook input with the argument execution_label = 1. If the notebook contains such input and you do not select a substitute in your workflow, it runs with the input defined in the notebook.
Substitute Input 2 Optional. Select the connected input to use as a substitute for notebook input with the argument execution_label = 2.

If the notebook contains such input and you do not select a substitute in your workflow, it runs with the input defined in the notebook.

Substitute Input 3 Optional. Select the connected input to use as a substitute for notebook input with the argument execution_label = 3. If the notebook contains such input and you do not select a substitute in your workflow, it runs with the input defined in the notebook.
Data Source (HD) Select the Hadoop data source in which to store output from notebook execution (if defined).

If inputs are connected to the Python Execute operator, Data Source (HD) must match the data source of your inputs.

Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.

Output

Visual Output
The operator result panel displays one or two tabs, depending on whether it is terminal.
Output (only available if the operator is not terminal):



Summary of the parameters selected and notebook execution results:



Data Output
If the notebook contains an output with argument use_output_substitution = True, the operator transmits a tabular data set to subsequent operators.

If no output is defined in the notebook, this operator is terminal.

Example