Data from Apache Spark SQL – Apache Spark SQL Connector


With the connector for Apache Spark SQL, you can access data from Apache Spark SQL databases and from Databricks.

On this page

Get started

For help getting started with accessing data from Apache Spark SQL and Databricks in Spotfire, see the following resources:

Connector features

The following functionality is available when you access data with the connector for Apache Spark SQL.

Feature

Supported?

Load methods

Learn more...

  • Import (in-memory)

  • External (in-database)

  • On-demand

Data types

Learn more...

Supported data types in connections to Apache Spark SQL

Functions

Supported functions for in-database data from Apache Spark SQL

Custom queries

Learn more...

Yes

Stored procedures

Learn more...

Yes

Custom connection properties

Learn more...

Yes

Single sign-on with identity provider

Learn more...

Yes

Authoring in web client

Yes

Supported on Linux Web Player

Yes

Data source properties

The following are the supported data source properties that you can configure when you create a data connection with the connector for Apache Spark SQL. To learn more, see Properties in connection data sources.

Note: For more information about the properties and the corresponding settings in the driver software, see the official documentation from the driver vendor.

Option

Description

Server

The name of the server where your data is located. To include the port number that the Spark Thrift Server listens on, add it directly after the name preceded by colon. For example:
MyDatabaseServer:10001

Note: If you do not specify a port number, the port number 10000 will be used, which is the default port number that Spark Thrift Server listens on.

Authentication method

The authentication method to use when logging into the database. Choose from

  • No authentication

  • Kerberos

  • Username

  • Username and password

  • Microsoft Azure HDInsight Service

  • Identity provider (OAuth2)

Host FQDN

[Only applicable when Kerberos authentication is selected.]

The fully qualified domain name of the Spark Thrift Server host. For more information about the host FQDN, contact your Apache Spark SQL system administrator.

Service name

[Only applicable when Kerberos authentication is selected.]

The Kerberos service principal name of the Spark server. For example, "spark". For more information about the service name, contact your Apache Spark SQL system administrator.

Realm

[Only applicable when Kerberos authentication is selected.]

The realm of the Spark Thrift Server host. Leave blank if a default Kerberos realm has been configured for your Kerberos setup. For more information about the realm, contact your Apache Spark SQL system administrator.

Use secure sockets layer (SSL)

Select this check box to connect using SSL.

Note: By default, SSL is enabled.

   Allow common name host name mismatch
  

[Only applicable when Use secure sockets layer (SSL) is selected.]

Select this check box if it should be allowed that the certificate name does not match the host name of the server.

   Allow self-signed server certificate

   

[Only applicable when Use secure sockets layer (SSL) is selected.]

Select this check box to allow self-signed certificates from the server.

Thrift transport mode



Select the transport mode that should be used to send requests to the Spark Thrift Server. Choose from

  • Default (The Spark SQL ODBC driver will use either binary or SASL, depending on the Spark Server version you are connecting to.)

  • Binary

  • SASL

  • HTTP

Identity provider

[Only applicable when Identity provider (OAuth2) authentication is selected.]

Select the identity provider you want to use for logging in to the data source.

The options available in the drop-down menu are the identity providers you have added to the OAuth2IdentityProviders preference.

Scopes

[Only applicable when Identity provider (OAuth2) authentication is selected.]

Scopes determine what permissions Spotfire requests on your behalf when you log in to the data source.

Default

Use the default scopes that you have specified for your identity provider in the OAuth2IdentityProviders preference.

Custom

Enter scopes manually in the text box. Separate values with a space.

Scope_1 Scope_2  

HTTP Path

[Only available when Thrift transport mode HTTP is selected.]

Specify the partial URL that corresponds to the Spark server you are connecting to.

Note: The partial URL is appended to the host and port specified in the server field. For example, to connect to the HTTP address http://example.com:10002/gateway/default/spark, enter example.com:10002 as the server and /gateway/default/spark as the HTTP path.

Connection timeout (s)

The maximum time, in seconds, allowed for a connection to the database to be established.

The default value is 120 seconds.

Command timeout (s)

The maximum time, in seconds, allowed for a command to be executed.

The default value is 1800 seconds.

Catalog

The catalog to access data from.

Custom properties for Apache Spark SQL connection data sources

The following is the default list of driver settings that are allowed as custom properties in Apache Spark SQL connection data sources. To learn how to change the allowed custom properties, see Controlling what properties are allowed.

Default allowed custom properties

ADUserNameCase, AOSS_AuthMech, AOSS_CheckCertRevocation, AOSS_Min_TLS, AOSS_PWD, AOSS_TrustedCerts,

AOSS_UID, AOSS_UseSystemTrustStore, AsyncExecPollInterval, AutoReconnect, BinaryColumnLength,

Canonicalization, CheckCertRevocation, ClientCert, ClientPrivateKey, ClientPrivateKeyPassword,

ClusterAutostartRetry, ClusterAutostartRetryTimeout, DecimalColumnScale, DefaultStringColumnLength,

DelegateKrbCreds, DelegationUID, DriverConfigTakePrecedence, EnableAsyncExec, EnablePKFK,

EnableQueryResultDownload, EnableStragglerDownloadMitigation, EnableSynchronousDownloadFallback,

FastSQLPrepare, ForceSynchronousExec, HTTPAuthCookies, InvalidSessionAutoRecover, LCaseSspKeyName,

MaximumStragglersPerQuery, Min_TLS, ProxyHost, ProxyPort, ProxyPWD, ProxyUID, QueryTimeoutOverride,

RateLimitRetry, RateLimitRetryTimeout, RowsFetchedPerBlock, ServiceDiscoveryMode, ShowSystemTable,

SocketTimeout, StragglerDownloadMultiplier, StragglerDownloadPadding, StragglerDownloadQuantile,

ThrowOnUnsupportedPkFkRestriction, TrustedCerts, TwoWaySSL, UseNativeQuery, UseOnlySSPI, UseProxy,

UseSystemTrustStore, UseUnicodeSqlCharacterTypes

 

Working with and troubleshooting Apache Spark SQL data connections

The following is information specifically about working with data from an Apache Spark SQL connection.

Prerequisite: Spark Thrift Server

To access data in Apache Spark SQL with the Spotfire connector for Apache Spark SQL, the Spark Thrift Server must be installed on your cluster. Spark Thrift Server provides access to Spark SQL via ODBC, and it might not be included by default on some Hadoop distributions.

Prerequisite: spark.shuffle.service.enabled

If you use the in-database load method when connecting to Apache Spark 2.1 or later, and you encounter errors in your analysis, the option spark.shuffle.service.enabled might have to be enabled on the Spark server.

Connecting to Databricks SQL Analytics

You can also create an Apache Spark SQL connection for performing Databricks SQL Analytics queries. To be able to connect to Databricks, you must install the Databricks ODBC driver. Check the system requirements for the Apache Spark SQL connector, and see Drivers and data sources in Spotfire for finding the right driver.

Databricks cluster that is not running  

When connecting to a Databricks cluster that is not already running, the first connection attempt will trigger the cluster to start. This can take several minutes. The Database selection menu will be populated once Spotfire is connected successfully. You may have to click Connect again if the connection times out.

See also:

Apache Spark SQL Data Types

Supported Functions