About Data Sampling

Random sampling of source data increases performance of the indexing and discovery process, although with some reduction in accuracy. In general, Discovery finds relationships that matter while not presenting false negatives for review.

The need to use data source sampling is driven by data volume, available memory, and the bandwidth available between source systems and TDV. If indexing takes too long, you might want to enable data sampling. With data sampling enabled, only a portion of the rows in the table or view are indexed.

How sampling works depends on what is being indexed and the type of data source being discovered.

For tables, data sampling is enabled and controlled via two Studio configuration parameters and the table cardinality. Discovery applies the data sampling algorithm only if both of these are true:

• The cardinality of a table exceeds the data sampling threshold specified by the Sampling Size configuration parameter.

With sampling enabled and the threshold exceeded, the number of rows indexed for a table is calculated using the formula:

For example, if you have a 1 million row table and Sampling Size is set to 100000, then 10% of the table will be indexed.

Which rows are indexed is controlled by a random number generator.

If data sampling is not enabled, data sampling does not occur even if the table cardinality threshold is exceeded.

If a table resides in one of these data sources—Oracle, DB2, MySQL, Netezza, or Microsoft SQL Server—data sampling is pushed to the data source. Otherwise, all data is fetched into TDV for sampling there.

For views, data sampling is enabled and controlled by two Studio configuration parameters:

See Configuring Data Sampling.

All rows are indexed up to the Sampling Size threshold. If sampling is enabled, Discovery then starts indexing on a decreasing scale. That is, when the Sampling Size threshold is passed, Discovery begins indexing half as many rows. If the (Sampling Size x 2) is reached, then one quarter as many rows are indexed. Which rows are indexed is controlled by a random number generator.

If data sampling is not enabled, data sampling does not occur even if the threshold is exceeded.