Data from Apache Spark SQL – Apache Spark SQL Connector

With the connector for Apache Spark SQL, you can access data from Apache Spark SQL databases and from Databricks.

On this page

Get started
Connector features
Data source properties

Available custom properties

Working with connections to Apache Spark SQL

Get started

For help getting started with accessing data from Apache Spark SQL and Databricks in Spotfire, see the following resources:

The Apache Spark SQL connector requires that you install a driver. See the system requirements to find the correct driver. You can also view Getting Started with Connectors to learn more about getting access to connectors in Spotfire.
To learn how to get started and create a new data connection, see Connecting to a data source.
To learn what data sources you can connect to, see Drivers and data sources in Spotfire.
To learn more about data connections and connectors, see What is a data connection?

Connector features

The following functionality is available when you access data with the connector for Apache Spark SQL.

Feature	Supported?
Load methods Learn more...	Import (in-memory) External (in-database) On-demand
Data types Learn more...	Supported data types in connections to Apache Spark SQL
Functions	Supported functions for in-database data from Apache Spark SQL
Custom queries Learn more...	Yes
Stored procedures Learn more...	Yes
Custom connection properties Learn more...	Yes
Single sign-on with identity provider Learn more...	Yes
Authoring in web client	Yes
Supported on Linux Web Player	Yes

Data source properties

The following are the supported data source properties that you can configure when you create a data connection with the connector for Apache Spark SQL. To learn more, see Properties in connection data sources.

Note: For more information about the properties and the corresponding settings in the driver software, see the official documentation from the driver vendor.

Option	Description
Server	The name of the server where your data is located. To include the port number that the Spark Thrift Server listens on, add it directly after the name preceded by colon. For example: MyDatabaseServer:10001 Note: If you do not specify a port number, the port number 10000 will be used, which is the default port number that Spark Thrift Server listens on.
Authentication method	The authentication method to use when logging into the database. Choose from No authentication Kerberos Username Username and password Microsoft Azure HDInsight Service Identity provider (OAuth2)
Host FQDN	[Only applicable when Kerberos authentication is selected.] The fully qualified domain name of the Spark Thrift Server host. For more information about the host FQDN, contact your Apache Spark SQL system administrator.
Service name	[Only applicable when Kerberos authentication is selected.] The Kerberos service principal name of the Spark server. For example, "spark". For more information about the service name, contact your Apache Spark SQL system administrator.
Realm	[Only applicable when Kerberos authentication is selected.] The realm of the Spark Thrift Server host. Leave blank if a default Kerberos realm has been configured for your Kerberos setup. For more information about the realm, contact your Apache Spark SQL system administrator.
Use secure sockets layer (SSL)	Select this check box to connect using SSL. Note: By default, SSL is enabled.
Allow common name host name mismatch	[Only applicable when Use secure sockets layer (SSL) is selected.] Select this check box if it should be allowed that the certificate name does not match the host name of the server.
Allow self-signed server certificate	[Only applicable when Use secure sockets layer (SSL) is selected.] Select this check box to allow self-signed certificates from the server.
Thrift transport mode	Select the transport mode that should be used to send requests to the Spark Thrift Server. Choose from Default (The Spark SQL ODBC driver will use either binary or SASL, depending on the Spark Server version you are connecting to.) Binary SASL HTTP
Identity provider	[Only applicable when Identity provider (OAuth2) authentication is selected.] Select the identity provider you want to use for logging in to the data source. The options available in the drop-down menu are the identity providers you have added to the OAuth2IdentityProviders preference.
Scopes	[Only applicable when Identity provider (OAuth2) authentication is selected.] Scopes determine what permissions Spotfire requests on your behalf when you log in to the data source. Default Use the default scopes that you have specified for your identity provider in the OAuth2IdentityProviders preference. Custom Enter scopes manually in the text box. Separate values with a space. Scope_1 Scope_2
HTTP Path	[Only available when Thrift transport mode HTTP is selected.] Specify the partial URL that corresponds to the Spark server you are connecting to. Note: The partial URL is appended to the host and port specified in the server field. For example, to connect to the HTTP address http://example.com:10002/gateway/default/spark, enter example.com:10002 as the server and /gateway/default/spark as the HTTP path.
Connection timeout (s)	The maximum time, in seconds, allowed for a connection to the database to be established. The default value is 120 seconds.
Command timeout (s)	The maximum time, in seconds, allowed for a command to be executed. The default value is 1800 seconds.
Catalog	The catalog to access data from.

Custom properties for Apache Spark SQL connection data sources

The following is the default list of driver settings that are allowed as custom properties in Apache Spark SQL connection data sources. To learn how to change the allowed custom properties, see Controlling what properties are allowed.

Default allowed custom properties

ADUserNameCase, AOSS_AuthMech, AOSS_CheckCertRevocation, AOSS_Min_TLS, AOSS_PWD, AOSS_TrustedCerts,

AOSS_UID, AOSS_UseSystemTrustStore, AsyncExecPollInterval, AutoReconnect, BinaryColumnLength,

Canonicalization, CheckCertRevocation, ClientCert, ClientPrivateKey, ClientPrivateKeyPassword,

ClusterAutostartRetry, ClusterAutostartRetryTimeout, DecimalColumnScale, DefaultStringColumnLength,

DelegateKrbCreds, DelegationUID, DriverConfigTakePrecedence, EnableAsyncExec, EnablePKFK,

EnableQueryResultDownload, EnableStragglerDownloadMitigation, EnableSynchronousDownloadFallback,

FastSQLPrepare, ForceSynchronousExec, HTTPAuthCookies, InvalidSessionAutoRecover, LCaseSspKeyName,

MaximumStragglersPerQuery, Min_TLS, ProxyHost, ProxyPort, ProxyPWD, ProxyUID, QueryTimeoutOverride,

RateLimitRetry, RateLimitRetryTimeout, RowsFetchedPerBlock, ServiceDiscoveryMode, ShowSystemTable,

SocketTimeout, StragglerDownloadMultiplier, StragglerDownloadPadding, StragglerDownloadQuantile,

ThrowOnUnsupportedPkFkRestriction, TrustedCerts, TwoWaySSL, UseNativeQuery, UseOnlySSPI, UseProxy,

UseSystemTrustStore, UseUnicodeSqlCharacterTypes

Working with and troubleshooting Apache Spark SQL data connections

The following is information specifically about working with data from an Apache Spark SQL connection.

Prerequisite: Spark Thrift Server

To access data in Apache Spark SQL with the Spotfire connector for Apache Spark SQL, the Spark Thrift Server must be installed on your cluster. Spark Thrift Server provides access to Spark SQL via ODBC, and it might not be included by default on some Hadoop distributions.

Prerequisite: spark.shuffle.service.enabled

If you use the in-database load method when connecting to Apache Spark 2.1 or later, and you encounter errors in your analysis, the option spark.shuffle.service.enabled might have to be enabled on the Spark server.

Connecting to Databricks SQL Analytics

You can also create an Apache Spark SQL connection for performing Databricks SQL Analytics queries. To be able to connect to Databricks, you must install the Databricks ODBC driver. Check the system requirements for the Apache Spark SQL connector, and see Drivers and data sources in Spotfire for finding the right driver.

Databricks cluster that is not running

When connecting to a Databricks cluster that is not already running, the first connection attempt will trigger the cluster to start. This can take several minutes. The Database selection menu will be populated once Spotfire is connected successfully. You may have to click Connect again if the connection times out.