With the connector for Apache Spark SQL, you can access data from Apache Spark SQL databases and from Databricks.
On this page
For help getting started with accessing data from Apache Spark SQL and Databricks in Spotfire, see the following resources:
The Apache Spark SQL connector requires that you install a driver. See the system requirements to find the correct driver. You can also view Getting Started with Connectors to learn more about getting access to connectors in Spotfire.
To learn how to get started and create a new data connection, see Connecting to a data source.
To learn what data sources you can connect to, see Drivers and data sources in Spotfire.
To learn more about data connections and connectors, see What is a data connection?
The following functionality is available when you access data with the connector for Apache Spark SQL.
Feature |
Supported? |
Load methods |
|
Data types |
|
Functions |
Supported functions for in-database data from Apache Spark SQL |
Custom queries |
Yes |
Stored procedures |
Yes |
Custom connection properties |
Yes |
Single sign-on with identity provider |
Yes |
Authoring in web client |
Yes |
Supported on Linux Web Player |
Yes |
The following are the supported data source properties that you can configure when you create a data connection with the connector for Apache Spark SQL. To learn more, see Properties in connection data sources.
Note: For more information about the properties and the corresponding settings in the driver software, see the official documentation from the driver vendor.
Option |
Description |
Server |
The name
of the server where your data is located. To include the port
number that the Spark Thrift Server listens on, add it directly
after the name preceded by colon. For example: Note: If you do not specify a port number, the port number 10000 will be used, which is the default port number that Spark Thrift Server listens on. |
Authentication
method |
The authentication method to use when logging into the database. Choose from
|
Host
FQDN |
[Only applicable when Kerberos authentication is selected.] The fully qualified domain name of the Spark Thrift Server host. For more information about the host FQDN, contact your Apache Spark SQL system administrator. |
Service
name |
[Only applicable when Kerberos authentication is selected.] The Kerberos service principal name of the Spark server. For example, "spark". For more information about the service name, contact your Apache Spark SQL system administrator. |
Realm |
[Only applicable when Kerberos authentication is selected.] The realm of the Spark Thrift Server host. Leave blank if a default Kerberos realm has been configured for your Kerberos setup. For more information about the realm, contact your Apache Spark SQL system administrator. |
Use secure sockets
layer (SSL) |
Select this check box to connect using SSL. Note: By default, SSL is enabled. |
Allow
common name host name mismatch |
[Only applicable when Use secure sockets layer (SSL) is selected.] Select this check box if it should be allowed that the certificate name does not match the host name of the server. |
Allow self-signed server certificate
|
[Only applicable when Use secure sockets layer (SSL) is selected.] Select this check box to allow self-signed certificates from the server. |
Thrift transport mode
|
Select the transport mode that should be used to send requests to the Spark Thrift Server. Choose from
|
Identity provider |
[Only applicable when Identity provider (OAuth2) authentication is selected.] Select the identity provider you want to use for logging in to the data source. The options available in the drop-down menu are the identity providers you have added to the OAuth2IdentityProviders preference. |
Scopes |
[Only applicable when Identity provider (OAuth2) authentication is selected.] Scopes determine what permissions Spotfire requests on your behalf when you log in to the data source. Default Use the default scopes that you have specified for your identity provider in the OAuth2IdentityProviders preference. Custom Enter scopes manually in the text box. Separate values with a space. Scope_1 Scope_2 |
HTTP Path |
[Only available when Thrift transport mode HTTP is selected.] Specify the partial URL that corresponds to the Spark server you are connecting to. Note: The partial URL is appended to the host and port specified in the server field. For example, to connect to the HTTP address http://example.com:10002/gateway/default/spark, enter example.com:10002 as the server and /gateway/default/spark as the HTTP path. |
Connection timeout (s) |
The maximum time, in seconds, allowed for a connection to the database to be established. The default value is 120 seconds. |
Command timeout (s) |
The maximum time, in seconds, allowed for a command to be executed. The default value is 1800 seconds. |
Catalog |
The catalog to access data from. |
Custom properties for Apache Spark SQL connection data sources
The following is the default list of driver settings that are allowed as custom properties in Apache Spark SQL connection data sources. To learn how to change the allowed custom properties, see Controlling what properties are allowed.
Default allowed custom properties
ADUserNameCase, AOSS_AuthMech, AOSS_CheckCertRevocation, AOSS_Min_TLS, AOSS_PWD, AOSS_TrustedCerts,
AOSS_UID, AOSS_UseSystemTrustStore, AsyncExecPollInterval, AutoReconnect, BinaryColumnLength,
Canonicalization, CheckCertRevocation, ClientCert, ClientPrivateKey, ClientPrivateKeyPassword,
ClusterAutostartRetry, ClusterAutostartRetryTimeout, DecimalColumnScale, DefaultStringColumnLength,
DelegateKrbCreds, DelegationUID, DriverConfigTakePrecedence, EnableAsyncExec, EnablePKFK,
EnableQueryResultDownload, EnableStragglerDownloadMitigation, EnableSynchronousDownloadFallback,
FastSQLPrepare, ForceSynchronousExec, HTTPAuthCookies, InvalidSessionAutoRecover, LCaseSspKeyName,
MaximumStragglersPerQuery, Min_TLS, ProxyHost, ProxyPort, ProxyPWD, ProxyUID, QueryTimeoutOverride,
RateLimitRetry, RateLimitRetryTimeout, RowsFetchedPerBlock, ServiceDiscoveryMode, ShowSystemTable,
SocketTimeout, StragglerDownloadMultiplier, StragglerDownloadPadding, StragglerDownloadQuantile,
ThrowOnUnsupportedPkFkRestriction, TrustedCerts, TwoWaySSL, UseNativeQuery, UseOnlySSPI, UseProxy,
UseSystemTrustStore, UseUnicodeSqlCharacterTypes
Working with and troubleshooting Apache Spark SQL data connections
The following is information specifically about working with data from an Apache Spark SQL connection.
Prerequisite: Spark Thrift Server
To access data in Apache Spark SQL with the Spotfire connector for Apache Spark SQL, the Spark Thrift Server must be installed on your cluster. Spark Thrift Server provides access to Spark SQL via ODBC, and it might not be included by default on some Hadoop distributions.
Prerequisite: spark.shuffle.service.enabled
If you use the in-database load method when connecting to Apache Spark 2.1 or later, and you encounter errors in your analysis, the option spark.shuffle.service.enabled might have to be enabled on the Spark server.
Connecting to Databricks SQL Analytics
You can also create an Apache Spark SQL connection for performing Databricks SQL Analytics queries. To be able to connect to Databricks, you must install the Databricks ODBC driver. Check the system requirements for the Apache Spark SQL connector, and see Drivers and data sources in Spotfire for finding the right driver.
Databricks cluster that is not running
When connecting to a Databricks cluster that is not already running, the first connection attempt will trigger the cluster to start. This can take several minutes. The Database selection menu will be populated once Spotfire is connected successfully. You may have to click Connect again if the connection times out.