Spark Autotuning

Tuning your Spark parameters can get confusing. Team Studio includes automatic optimization for increased performance on Spark jobs.

Based on the size of your cluster, the available resources in your queue, the size of the input data, and what is known about the operator, Team Studio can dynamically assign Spark parameters at runtime. Spark autotuning is currently available on the following operators.

  • Aggregation
  • Alpine Forest Classification
  • Alpine Forest Regression
  • ARIMA Time Series
  • Association Rules
  • Batch Aggregation
  • Classification Threshold Metrics
  • Collapse
  • Column Filter
  • Correlation
  • Correlation Filter
  • Distinct
  • Fuzzy Join
  • Gradient Boosting Classification
  • Gradient Boosting Regression
  • Join
  • K-Means
  • LDA Trainer
  • LDA Predictor
  • N-Gram Dictionary Builder
  • Naive Bayes
  • Neural Network
  • Normalization
  • Null Value Replacement
  • Numeric to Text
  • Pivot
  • Replace Outliers
  • Sort by Multiple Columns
  • Linear Regression
  • Logistic Regression
  • Resampling
  • Row Filter
  • Set Operations
  • Stability Selection
  • Summary Statistics
  • Text Extractor
  • Text Featurizer
  • Transpose
  • Unpivot
  • Variable
  • Window Functions - Aggregate
  • Window Functions - Lag/Lead
  • Window Functions - Rank

To enable Spark autotuning, no action is required. These operators default to Automatic Optimization being applied. You can apply a greater degree of control by editing the advanced configuration for each of the Spark settings.

Team Studio sets the following Spark parameters.

  • spark.executor.memory
  • spark.driver.memory
  • spark.executor.cores
  • spark.default.parallelism and spark.sql.shuffle.partitions

Additionally, Team Studio can determine if dynamic allocation is enabled on the cluster and, if so, to use that to choose the maximum number of executors (spark.dynamic.allocation.max.executors and spark.dynamic.allocation.enabled). If dynamic allocation is not enabled on the cluster, Team Studio sets a value for spark.executor.instances based on your cluster size, input data, and current operator.

To override the settings, set Automatic Optimization to no, and then edit the settings provided in the Advanced Settings dialog box or add your own key/value pairs. Team Studio always uses a setting provided by the user before attempting to specify one.