Streaming database connector FAQs
- How does streaming database connector work?
The Streaming DB Connector creates an object that implements the Statistica Spreadsheet COM interface and that wraps an instance of a Microsoft ADO Recordset object. Because all Statistica analyses access the source spreadsheet's data via this interface, this wrapper object looks just like any other spreadsheet to the analysis.
- Can I write data back to the external database?
Yes. With Statistica, you can write certain information computed by the program back to the original input data file or database and, thus, integrate computed statistics into an existing database or data warehouse. Specifically, using the Rapid Deployment of Models module, you can write computed statistics (predictions, predicted classifications, classification probabilities, residuals) back into the current input data file; this capability to, for example, merge classification probabilities computed by various models into an existing database or data warehouse is extremely useful in the context of data mining applications to deploy models for extremely large data sets. See the Query Options dialog box topic for additional details regarding the specific settings of options to enable Statistica to write results to fields in an external database.
- Why does the streaming database connector sometimes only display one case when I set Preview the first n cases to some number greater than one?
When using a forward-only cursor, the streaming DB connector does not advance the Recordset beyond the first case (row) because then those first n rows would be lost and when an analysis requested that data; the query would have to be executed again. Preview the first n cases only applies when using a static cursor.
- What is the purpose of the "case count" method chosen via the Statistica Options dialog box?
Depending on the options you have chosen, the exact number of cases may not be known yet when an analysis requires that information. If this is the case, then one of two things will happen, depending on your setting for this option. Either a separate "count" query will be executed, or an arbitrary, user chosen upper bound will be returned to the analysis.
- When would the exact number of cases not be known?
In these situations, the exact case count is not known:
a) Using a forward only cursor and have not advanced to the end yet
b) Using a static cursor with asynchronous query and/or fetch and the query and/or fetch have not completed yet.
When using a static cursor with synchronous query and fetch, the case count will be known as soon as the query is executed.
- What are the implications/advantages/drawbacks of the Automatically determine the number of cases option?
When the number of cases is not known and an analysis requires that information, and when this method is in effect, a separate 'count' query will be executed, and the analysis will have to wait on the result. If the query is complicated, this can be slow. In addition, if changes have been made to the database between the time you first executed your original query and when the 'count' query is executed, the results can be inconsistent; i.e., if someone else added or deleted records then the number of records in your cursor will not match the number of records reported to the analysis. The advantage is, if the query is not too complicated and the database has not been changed, this gives you an exact case count without having to resort to a slower cursor type that would always have that information.
- What are the implications/advantages/drawbacks of the Assume fewer than [x] number of cases option?
This method avoids the overhead of running a separate query to determine the number of cases by always returning the same number for the case count, therefore it is much faster. If there are in fact more cases than this in the Recordset, the cases beyond this number will be ignored by analyses. If there are fewer cases than this, the "extra" cases that the analysis will expect to see but are not available will be treated as missing data. If there are many, many fewer cases in the Recordset than the chosen upper bound indicates, the analysis will waste time attempting to process the nonexistent cases.
- What is a cursor?
A cursor is a data structure that stores the results of a query. The type of cursor determines the functionality available. With some cursors you move forward through the results of your query, with others you move forward and backward. Common types of cursors are static, dynamic, forward-only, and keyset. The streaming database connector supports forward only and static cursors.
- What is a static cursor?
A static cursor supports scrolling forward and backward through the data, thus allowing random access to the data. This cursor type provides a "snapshot" of the result of your query - records modified, added to, or removed from the database after the cursor is populated will not be visible. A server side static cursor can place considerable strain on the database system. A client-side cursor is always static. If an analysis or other usage of the streaming database connector requires random access or multiple passes through the data, this type of cursor may be needed.
- What is a forward only cursor?
A forward only cursor, the simplest type of cursor, moves only forward through the results of your query. After you move beyond a record, it is no longer available in the cursor. This type of cursor provides fast access to the data, and places the least stress on the database system. In the context of the streaming database connector, if a user or analysis must make multiple passes through or needs random access to the data, then the query must be rerun to return to a previous record. If records were modified, added to, or removed from the database in the meantime, this re-query can result in different data or a different number of records. If a particular analysis or usage of streaming DB connector requires a single forward pass through the data, this type of cursor will most likely be the fastest
- What is a client side cursor?
The cursor is maintained on the client (user's) machine, with the Microsoft ActiveX Data Objects (ADO) Cursor Engine. This means that all records returned from the query are copied to the user's local machine. A client side cursor is a static cursor.
- What is a server side cursor?
The cursor is maintained on the database server. The number of records specified by the "cache size" will be copied to the user's local machine, while the rest are stored on the server. Whenever a record not in the local cache is requested, a new set of records will be copied from the server to the client machine. This type of cursor can place considerable strain on the database system, as it must store the results of all such queries.
- Which type and location of cursor is best?
The answer to this question depends on many factors, such as how you intend to use the data, how often the data in the database you are using changes, etc, so there is no "right" answer. In general, the fastest cursor is the server side, forward only type, but if you need to make multiple passes or random access, then the necessary re-queries may considerably degrade performance, and the data may change between passes. A client side cursor with asynchronous query and asynchronous fetch offers very good performance, allows random access to the data, and the data will never change during the course of an analysis. Just remember that all records returned from the query end up on your local machine.
- Can I use a Keyset or Dynamic cursor type with streaming database connector?
These cursor types, which are available with ADO, are not supported via the user interface of the streaming DB connector. These cursor types offer features such as allowing the user to update the database, or to see changes made by other users. However, these features are not necessary for the streaming DB connector since its purpose is not to update a database. In addition, the additional features provided by these cursor types make them much slower than forward-only and static cursors. It is possible, however, to use this cursor type programmatically when using the streaming database connector Object model.
- What is "cache size"?
When using a server side cursor, this number of records is cached locally on the client machine. When records not in the local cache are requested, the ADO Recordset must fetch more data from the remote database.
- What does "asynchronous query" mean?
The streaming database connector does not wait for the query to complete before returning control to Statistica. Before the query completes, nothing is known about the resulting Recordset; after it completes, the number of fields (variables) is known, along with their names. Depending on the cursor type, the number of records (cases) may be known as well. After the query completes, individual records can be retrieved from the cursor. When using this feature, you can continue using Statistica or even start an analysis on this streaming database connector before the query completes. Naturally, if you start an analysis before the query completes, the analysis itself will have to wait until the query completes before it can begin. For all but the simplest queries, it is recommended to use this option.
- What does "asynchronous fetch" mean?
This option applies when using a client side cursor (the ADO cursor engine). The streaming database connector does not wait for all records to be copied to the local machine before returning control to Statistica. You can continue using Statistica or even start an analysis on this streaming DB connector before the records are all fetched. An analysis begun on a Recordset that is still fetching will only have to wait when it needs records that have not yet been fetched. Except for queries that will only return a few records, this option is recommended, as copying hundreds of thousands of records to your local machine synchronously will tie up Statistica until the fetching is completed.
Fetch is the process of copying all records to the local client machine; therefore, this option only applies when using a client side cursor. When using this type of cursor, behind the scenes ADO runs the specified query and opens an internal forward only cursor. It makes a complete pass through that cursor copying all records to the local machine. When this is done synchronously, all other operations are suspended until all records have been transferred to the client. When done asynchronously, the fetch is done in a separate thread of execution so that processing can continue. Records are available as soon as they are fetched, while requests for records that have not yet been fetched will "block" (wait) until they are retrieved. Thus, when using this option, the streaming DB connector does not wait for all records to be copied to the local machine before returning control to Statistica. You can continue using Statistica or even start an analysis on this streaming DB connector before the records are all fetched. An analysis begun on a Recordset that is still being fetched will only have to wait when it needs records that have not yet been fetched. Except for queries that will return only a few records, this option is recommended, as copying hundreds of thousands of records to your local machine synchronously will tie up Statistica until the fetch is completed.
- Why does the "maximum number of records to return from query" option not always work?
This option is passed to ADO, which does not actually implement it but passes it on to the OLE DB provider. At this time, this feature is implemented for the OLE DB Provider for SQL Server, the SQL Server ODBC Driver, as well as the OLE DB Provider for Oracle and ODBC Driver for Oracle.
- I want to programmatically create my own ADO Recordset object and attach it to an instance of a streaming database connector so I can perform a Statistica analysis on it. Is this possible?
Yes. Use the DBSpreadsheet method CreateNew, which returns a spreadsheet object, and call that object's SetRecordset function to attach it to an existing ADO Recordset. Or use the DBTable interface's spreadsheet property to get the spreadsheet and set the Recordset on that.
- I am writing Statistica Visual Basic code and accessing a database through the Spreadsheet interface provided by the streaming DB connector. Some of the spreadsheet methods always seem to fail. Why?
Not every spreadsheet interface method/property is implemented on the StreamingDB spreadsheet. Many of them make no sense or cannot be implemented. In general, you cannot call any spreadsheet method or property to change any data in the spreadsheet; the data is read only.
See also, the Streaming Database Connector and Cursor definitions, Streaming Database Connector Technology (Technical Overview), and Statistica Query - Streaming of Data on Remote Servers.