Workspace Node: Text Mining - Specifications - Projects Tab

In the Text Mining node dialog box, under the Specifications heading, select the Projects tab to access options to select the internal database file that will be used for indexing of the documents. For performance reasons, and as also briefly described in the Introductory Overview, Statistica Text Mining incorporates advanced relational database components to maintain and update the index of words or terms by document. With large document collections and complex text, these databases can become extremely large and, hence, an efficient relational database scheme was chosen to store this information. An additional advantage of this approach is that this database can be stored in a user-defined location on the hard disk and reused.

See also the Introductory Overview.

Note: Choosing an existing database; deploying databases. With the options on this tab, not only can you determine the location where the database containing the index of words/terms and documents is to be stored (note that this database can become quite large), but you can also select a specific existing database created during a previous analysis and use the information contained therein.

Databases created with Statistica Text Mining will contain not only the list of indexed words and their frequencies in each document, but also information about which of the indexed words were selected for the analysis (words or terms can be indexed, but not selected and, hence, ignored for subsequent results), as well as results from a singular value decomposition of the frequencies or other derived indices for the selected words (for more details, see also the Introductory Overview or Singular Value Decomposition in Statistica Text Mining and Document Retrieval). Using the options available on this tab, you can either update an existing database with the words or terms found in new documents, or you can index a new set of documents using only the selected terms in an existing database.

This type of indexing based on words selected during previous analyses can be considered a form of "deployment" of the database, in the sense of deployment commonly used in the context of trained models in predictive data mining. Note that the program can also compute word coefficients and document scores based on results from singular value decomposition performed in a prior analysis and stored in the database. Hence, these options enable you to compute scores for new documents based on a previous analysis; this functionality may be critical if you want to use information extracted from text in predictive data mining projects based on numeric indices that were derived during training from unstructured text.

Element Name Description
Project Use the options in this group box either to create a new database for indexing (project) or to select an existing database and index created in previous analyses.
Create new project Select this option button to create a new index; the index and database will be created in the location indicated in the Active project (database file) box, described below.
Use existing project Select this option button to use an existing database, for example, to deploy an existing database to score new documents based on the information extracted in prior analyses. You can then select the database file using Select button, described below.
Active project (database file) Specify here the name of an existing database or the name for a new database to hold the index and other information computed by STATISTICA Text Mining. Note that these database files can become quite large (with large collections of complex documents); hence, make sure to store this information on a hard disk with sufficient free space.
Select Click this button to browse to an existing database file or to specify a new file name and location (depending on Project selection). Clicking this button will display a standard file selection dialog box.
Existing project Use the options in this group box to specify how to use the information in an existing database (project); these options are only available if the Use existing project option button is selected in the Project group box.
View/Modify (go to Results dialog) Select this option button, and go directly to the Results tabs where you can review the information from previous analyses and update it by, for example, selecting different words or computing different indices.
Merge new documents into existing index Select this option button to append the indexing results for a new set of documents to an existing index/database instead of overwriting it.
Deploy new documents Select this option button to "score" the new documents, using the information contained in the current database and index. This option enables you to "deploy" the information in the existing database, in the sense of this term as it is commonly used in predictive data mining. You can use the information in the database to process the selected new documents, create results based on the previously selected words and terms, and compute word coefficients and document scores based on the singular value decomposition of results for documents used during "training." See also the Introductory Overview and the discussion of this topic at the top of this page for additional details.

Options / C. See Common Options.