Accessing data from Apache Spark SQL and Databricks

You can access data from Spark SQL and Databricks systems in Spotfire.

About this task

To connect to data in Spark SQL or Databricks systems, you use the connector for Apache Spark SQL. To learn about the functionality and features available when you work with data from these systems, see Connector for Apache Spark SQL — Features and settings.

Before you begin

The Apache Spark SQL connector requires a driver on the computer running Spotfire. See Drivers and data sources in Spotfire.
To make sure that your database is supported, see the system requirements for the Apache Spark SQL connector.

Procedure

Open the Files and data flyout, and click Connect to.
In the list of data sources, select Apache Spark SQL or Databricks.
In the panel on the right, choose if you want to create a new connection or add data from a shared data connection:
- Opening a shared data connection from the library
- Create new connection

Working with and troubleshooting Apache Spark SQL data connections

About this task

The following is information specifically about working with data from an Apache Spark SQL connection in Spotfire.

Prerequisite: Spark Thrift Server

To access data in Apache Spark SQL with the Spotfire connector for Apache Spark SQL, the Spark Thrift Server must be installed on your cluster. Spark Thrift Server provides access to Spark SQL via ODBC, and it might not be included by default on some Hadoop distributions.

Prerequisite: spark.shuffle.service.enabled

If you use the in-database load method when connecting to Apache Spark 2.1 or later, and you encounter errors in your analysis, the option spark.shuffle.service.enabled might have to be enabled on the Spark server.

Connecting to Databricks SQL Analytics

You can also create an Apache Spark SQL connection for performing Databricks SQL Analytics queries. To be able to connect to Databricks, you must install the Databricks ODBC driver. Check the system requirements for the Apache Spark SQL connector, and see Drivers and data sources in Spotfire for finding the right driver.

Databricks cluster that is not running

When connecting to a Databricks cluster that is not already running, the first connection attempt will trigger the cluster to start. This can take several minutes. The Database selection menu will be populated once Spotfire is connected successfully. You may have to click Connect again if the connection times out.

Apache Spark SQL temporary views and tables in custom queries

If you are creating a custom query and you want to use data from an Apache Spark SQL temporary table or view, you must refer to those objects using their qualified names, specifying both the name and the location of the object. The qualified names required have the following format:

databaseName.tempViewName

By default, global temporary views are stored in the global_temp database. The database name can vary, and you can see it in the hierarchy of available database tables in Spotfire. To select all columns from a global temporary view named myGlobalTempView, that is stored in the global_temp database:

SELECT * FROM global_temp.myGlobalTempView

Temporary views/tables (listed in Spotfire under 'Temporary views' or 'Temporary tables') are always located in the #temp database. To select all columns in a temporary view named myTempView:

SELECT * FROM #temp.myTempView

User agent tagging

If the ODBC driver that you use supports the UserAgentEntry option, Spotfire includes the following string as the UserAgentEntry in queries:

TIBCOSpotfire/<ProductVersion>