Using Statistica Data Miner with Extremely Large Data Sets

The entire Statistica family of products and Statistica Data Miner in particular are specifically optimized to process efficiently extremely large data sets, with millions of observations (records) and millions of variables (fields).

Processing databases that are larger than the local storage device

Statistica Data Miner (and optionally other Statistica products) can process data in (remote) databases in-place by its highly optimized Streaming Database Connector technology, which combines the processing resources of the database server and the local computer to a) perform the queries (using the database server CPU) while simultaneously b) processing the fetched records on the local machine [using the local computer (client)CPU]. This way, databases that are larger than what could fit on the local machine can  be processed, and significant performance gains can be achieved by saving the time that would normally be required to first import the data to the local device and only then process them locally. Practically all common database formats are supported, and powerful tools are provided for defining the database connection (query).

Processing databases with extremely large numbers of variables (fields): The unique Feature Selection and Variable Screening Facilities

When the number of variables in the input data file is extremely large, Statistica Data Miner can automatically select subsets of variables from among even millions of variables (candidates) for predictive data mining. The extremely fast and efficient algorithm will select variables (features) that are likely to be the most relevant predictors in the current data set, without introducing biases into subsequent model building for predictive data mining. See Feature Selection and Variable Screening for details.

Processing data files with extremely large numbers of cases (records): Flexible and efficient random sampling

Statistica products (including Statistica Data Miner) can process data files with practically unlimited numbers of cases (records), and Statistica's data access procedures are highly optimized. However, including all records in the analyses when the number of records is extremely large is

  1. entirely unnecessary,
  2. time consuming, and
  3. often impractical or impossible (in extreme cases it could take hours merely to read all records)

    In order to speed up the analytic process, Statistica Data Miner includes sophisticated tools for drawing representative, perfectly random samples from huge data sets (databases). You can quickly extract simple or systematic random samples of appropriate sizes, with or without replacement, from huge data sets (with many millions of records) for further analyses with sophisticated modeling tools that may require multiple passes through the data. The random sub-sampling is based on validation by Statistica random number generator. Note that Statistica is one of only few commercially available software products that have passed the most advanced and most recognized tests for randomness.