Settings for Spark-Enabled Operators
The easiest and quickest way to change Spark settings is from the operator itself.
- You can enable Automatic Configuration for the Spark parameters. Team Studio selects default values to run the operator.
- You can edit these parameters in the operator parameter dialog box directly by setting Advanced Settings Automatic Optimization to No, and then clicking Edit Settings to display and edit the settings in the Advanced Settings dialog box.
Additional parameters can be available, depending on the operator. Additionally, you can add parameters, using any of those mentioned in the official Spark documentation.
As a use case example, imagine that you are parsing a lot of files using the Text Extractor. The Spark job keeps failing or going very slow. Depending on your input data, you can take one of the following actions to correct these problems.
- If you have lots of medium or small sized file (hundreds of thousands of files < 40MB) to parse and the job is failing, you should try to increase the driver memory and the number of executors.
- If you have bigger files to parse ( > 90MB) and the Spark job is failing, increase the executor memory so that the bigger files are parsed by a single executor. You should also increase the driver memory.
Data Source configuration
Spark settings can be changed on the data source itself. To do this, you must have access to the Hadoop cluster with Spark installed.
Tips and tricks
For more information about Spark optimization, see the following resources.
- Performance Tuning for SparkSQL
- Spark on YARN Parameters
- Spark Tuning Cheat-Sheet (mentioned in the video lecture)
- Top 5 Mistakes When Writing Spark Applications
- Advanced Settings Dialog Box
When Spark is enabled for an operator, you can apply the Automatic configuration for the Spark parameters, setting the default values to run the operator. However, you can edit these parameters directly. - alpine.conf Spark Settings
The following settings are added to the Spark submission. You can edit Spark settings, Team Studio-specific Spark settings, and YARN settings. All of these values can be found and edited in the file alpine.conf. Any Spark tuning you define at the operator level takes precedence.
Copyright © 2021. Cloud Software Group, Inc. All Rights Reserved.