Initializing PySpark

You can initialize and use PySpark in your Jupyter Notebooks for Team Studio.

Perform this task from the Notebooks environment in Team Studio.

Prerequisites

This prerequisite update applies only if you created a notebook prior to version 6.5.0 of Team Studio using the Initialize Pyspark for Cluster function. This is required to accommodate Spark upgrades in the system.
  1. Regenerate the PySpark context by clicking Data > Initialize Pyspark for Cluster.

  2. Change the previously-generated code to the following:
    os.environ['PYSPARK_SUBMIT_ARGS']=
    "--master yarn-client --num-executors 1 --executor-memory 1g --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.11:3.0.1
    pyspark-shell"

If you do not have access to a Hadoop cluster, you can run your PySpark job in local mode. Before running PySpark in local mode, set the following configuration.

  1. Set the PYSPARK_SUBMIT_ARGS environment variable as follows:
    os.environ['PYSPARK_SUBMIT_ARGS']= '--master local pyspark-shell'
  2. YARN_CONF_DIR environment variable as follows:
    os.environ['YARN_CONF_DIR'] = ''

Procedure

  1. Create a new notebook.
  2. Click Data > > Initialize PySpark For Cluster.


  3. Choose an existing data source with which to run Spark.


    Note: Spark cannot be configured to connect to two clusters at the same time. Make sure that only one cluster is initialized for PySpark in your notebook, otherwise an error occurs.