Working with In-Database Data


When you are working with data from an external data source (in-database or in-db data) there are a number of differences to keep in mind, compared to working with data in memory. Some features work differently or are unavailable. See the table below for more information. If you are working with streaming data, there are some additional differences to keep in mind, see Working with Streaming Data to learn more.

One thing to think about when working with in-db data is that changes to the underlying database schema will not automatically be reflected in the Spotfire analysis. This means that if a column is added to a database table you need to perform a Refresh Schema operation in Spotfire in order to see the new column in the analysis. Not all users will have the sufficient database privileges to perform a full schema refresh. However, changes to the actual data can be updated using a simple Reload/Refresh Data by most users.

Note: Before you can work with in-db data in Spotfire there are few prerequisites that need to be met (for details, see TIBCO Spotfire® System Requirements):

Difference with in-memory

Why?

The table visualization and the Details-on-Demand can only display a limited number of rows (10 000 by default) from a relational in-db data source.

Virtual columns are not supported for in-db data.

There is a preference setting limiting the maximum number of rows that can be shown in table visualizations and Details-on-Demand, to prevent large databases from overflowing Spotfire.

The default limit is 10 000. A Spotfire administrator can configure the limit by editing the preference setting TableVisualizationExternalRowLimit in the Administration Manager.

 

A Spotfire administrator might configure a limit for how many rows can be retrieved and visualized, when working with in-database data. This limit affects, for example, visualizations and certain filter types.

Trying to visualize data in a way that exceeds this limit, for example by not aggregating the data enough, will result in an error.

List box filters, hierarchy filters and item filters cannot be created for variables that contain too many unique values.

In the details view of the expanded data in analysis flyout, you cannot see statistics for categorical columns that contain too many unique values.

When exploring large data tables, there is a risk of loading very large amounts of data. This can result in poor performance, because of the strain put on the network, the database and the system memory, particularly when using the web client.

To avoid these problems, there is a preference setting that can be used to limit the number of rows of data that can be retrieved and visualized, when working with in-database data.

A Spotfire administrator can configure the limit with the preference setting  ConnectorExternalRowLimit in the Administration Manager.

For in-db data from cubes you must select the columns to include in table visualizations manually.

Tip: You can enable this behavior for in-database data tables from connections to non-cube sources as well. Use the setting Manually add columns for table visualizations, available on the Performance tab of the Data Connection Settings dialog.

 

Using all available columns in a table visualization may result in too large queries towards the OLAP source, because the table view flattens the cube into a single table.

By selecting only the most important columns and perhaps use table visualizations as details visualizations only, you can reduce the impact of queries from Spotfire against the data source.

The table visualization cannot be sorted if the data table source is a cube.

Cubes can only return data in the natural order and reversed natural order, not sorted orders. Therefore, sorting is disabled for all table visualizations based on cube data.

There is a limit on markings in a visualization, when working with in-db data.

Because marking of more items (for example, bar segments, pie sectors or markers) results in a more complex query to the external data source, the limit will affect the number of items you can mark. Different visualization types and configurations have varied complexity, so the number of items you can mark will differ between visualizations.

The default value of the limit is 1000. This preference setting can be configured by a Spotfire administrator.

Tip: If you want to include more items in a marking, you can add items to an existing marking by holding Ctrl or Shift and marking further items. This way, you can create markings exceeding the marking limit.

Note: Marking all items in a visualization will not result in a complex query. This means that all items can always be marked, no matter what the limit is.

When you mark items in a visualization, Spotfire will generate a query to the external data source. Such a query can easily become very long and complex, if a visualization contains a large number of items; especially if you also have filtered values in the analysis.

Complex queries can cause poor performance. Therefore, a preference setting is available, limiting the number of items that can be marked in a visualization when working with in-db data.

Marking in table visualizations is only supported if the view has a primary key defined. This can be defined by an administrator in the underlying database table, or, for some types of data connections, in the Views in Connection dialog. You cannot mark in tables based on cube data.

A primary key is required to know which rows to mark.

Marking is disabled in visualizations which you have configured with a floating point expression on a categorical axis.

If you want to use marking in a visualization that is based on in-database data, avoid using categorical expressions (written between angle brackets, '<>') that return floating point values, on any axes.

Note: It is not only when you use custom expressions that you might encounter this limitation. For example, if you select a floating point column on the category axis of a bar chart, and configure the category axis to show a categorical scale, this is the same as writing a categorical custom expression that returns a floating point value on that axis.

Marking in in-database visualizations, which you have configured with a floating point expression on a categorical axis, can produce unpredictable results.

The summary table cannot be used with in-db data.

The summary table has not yet been adjusted to be able to handle in-db data.

The box plot cannot be used with in-db data.

The box plot requires both aggregated and non-aggregated data and has not yet been adjusted to handle in-db data.

Scatter plots and 3D scatter plots may need to be configured as aggregated visualizations when using in-db data.

The number of rows available on an external data source may be too large to handle within Spotfire.

Dendrograms can only be shown in heat maps if the X-axis is set to (None), when using in-db data. Highlighting and marking in the dendrogram is not supported.

Dendrograms are row-index based and in-db data tables do not have any row index.

Not all of the standard Spotfire aggregation methods and expressions are available on all external systems. On the other hand, there may be other methods available that Spotfire does not have by default.

The aggregation methods supported by the external data source determines which methods will be available for in-db data.

 

OVER expressions must be written as post-aggregation expressions, or as custom expressions, where applicable.

OVER expressions must be performed on the aggregated view. See Using Expressions on Aggregated Data (the THEN Keyword) for more information.

Automatically created date and time hierarchies are not available for in-db data.

The shortcuts for setting up date and time hierarchies have not been adjusted to handle in-db data.

Transformations are not available for in-db data.

 

Because no data is stored within the analysis in runtime, it is not possible to apply any transformations on in-db data.

(However, data tables based on data connection views can also be imported and once the data is in-memory, you can apply transformations to it.) See Transforming Data for more information.

If the connection to the external data source is lost, no further analysis using that data source can be performed.

Since no external data is stored within the analysis in runtime and each change induces a new query to be sent to the external data source, the data source must be available as long as changes to the visualizations are to take place.

Tags are not available for in-db data.

Tags are dependent on row numbers which are not available for in-db data.

You cannot delete columns or rows from in-db data tables.

In-db data tables contain no data in runtime so nothing can be deleted.

(However, you can hide columns from the source tables when selecting tables in the Views in Connection dialog (for relational data sources) or the Data Selection in Connection dialog (for cube data sources).

You cannot apply any of the following tools on in-db data tables:

Add Columns,

Add Rows,

Add Binned Column,

Data Relationships,

K-means Clustering,

Line Similarity,

Data Functions,

Regression Modeling,

Classification Modeling,

Add Predicted Columns

In-db data tables contain no data in runtime and can, hence, not incorporate any other data.

Add Calculated Column has the following limitations when working with in-db data:

* No cube connectors are supported.

* Some connectors (e.g., Cloudera, Hortonworks and Vertica) do not support aggregation methods.

* Binning is not supported.

* Post-aggregation expressions (the THEN keyword) are not supported, which means that OVER expressions are not supported.

The expression for the calculated column is evaluated in the underlying database management system. This means that only methods supported by the current connector will be available.

Cube attribute or dimension columns cannot be used on numeric axes.

Since all cube dimensions are mapped as String when cube data is added to Spotfire, these columns cannot be used on any numeric axes even if the name of the column may indicate that the result is numeric.

By importing parts of the cube data it is possible to convert string data to other data types.

On-demand on external data is not available directly when first adding the data connection to an analysis. Instead, you must add the data as External data and edit the On-demand settings from the data canvas to enable on-demand loading.

The Add data to analysis workflow always assumes that on-demand data should be imported.

On-demand is currently not supported for cube data sources (neither for imported cubes nor for in-database cubes).

On-demand has not yet been adjusted to be able to handle cube data.

In contrast to data connections from relational data sources, selections for cube data connections always result in a single data table.

The data selection steps for cube data sources are currently focused on limiting the flattened cube data, not on viewing different slices simultaneously.

Auto-bin Column only works for certain connectors; for example Teradata, Oracle, PostgreSQL, and SAP HANA.

The ability to use width_buckets for auto-binning is only supported by some external data sources.

Export Data from Data Table cannot be used when exporting from in-db data tables.

In-db data tables contain no data in runtime so nothing can be exported from the data table.

Tip: Use the Export Data from Visualization option when you want to export from in-db data tables. Note that for all visualizations other than the table and the cross table you need to mark the items of interest before opening the Export Data dialog.

String comparisons between different systems may provide different results due to handling of trailing spaces.

External systems that adhere to the SQL-92 standard (e.g., SQL Server, MySQL and Netezza) will trim trailing spaces for strings.

You need to provide wildcards to obtain fuzzy matching in text filters when working with in-db data.

Typing a part of a word with in-db data will only search for the exact match of those letters. (With in-memory data there will automatically be a wildcard appended to the entered letters.)

By default, if you have multiple filters for an in-db data table, the values that are available for selection in a filter do not reflect any filter selections you might have already made in other filters. This means that values that might already be filtered out will be available for selection, and you will not get the visual clue indicating which values have been filtered out using the other filters, the way you do with in-memory data tables.

To make filters for an in-database data table work more like filters for in-memory data tables, you can change the settings for the data connection to enable cascading filters for in-database tables.

 

For in-database data tables, hiding and graying out values that have been filtered out by other filters is a costly operation. Therefore this option, called cascading filters, must be enabled manually.

Caution: Enabling the cascading behavior for filters for in-database data tables results in an additional query load on the external data source. It is the user's responsibility to ensure that the external data source can handle the extra query load.

To enable cascading filters for in-database data tables from a connection:

  1. On the menu bar, select Data > Data Connection Properties.

  2. In the Data Connection Properties dialog, select the data connection of interest and click Settings....

  3. In the Data Connection Settings dialog, click the Performance Settings tab, and then select Enable cascading filters for in-database data tables from this connection.

  4. To save your changes and close the dialog, click OK.

See also:

Data Overview

Working with Streaming Data