Project Tab
Select the Project tab in the Text Mining dialog box to access options to select the internal database file that will be used for indexing of the documents. For performance reasons, and as also briefly described in the Introductory Overview, Statistica Text Mining and Document Retrieval incorporates advanced relational database components to maintain and update the index of words or terms by document. With large document collections and complex text, these databases can become extremely large and, hence, an efficient relational database scheme was chosen to store this information. An additional advantage of this approach is that this database can be stored in a user-defined location on the hard disk and reused. You can use the options on the Defaults tab to save or retrieve the settings for these options, and to set the defaults for future analyses.
Note: Choosing an existing database; deploying databases: With the options on this tab, not only can you determine the location where the database containing the index of words/terms and documents is to be stored (note that this database can become quite large), but you can also select a specific existing database created during a previous analysis and use the information contained therein.
Databases created with Statistica Text Mining and Document Retrieval contain not only the list of indexed words and their frequencies in each document, but also information about which of the indexed words were selected for the analysis (words or terms can be indexed, but not selected and, hence, ignored for subsequent results), as well as results from a singular value decomposition of the frequencies or other derived indices for the selected words. For more details, see also Singular Value Decomposition in Statistica Text Mining and Document Retrieval. Using the options available on this tab, you can either update an existing database with the words or terms found in new documents, or you can index a new set of documents using only the selected terms in an existing database.
This type of indexing based on words selected during previous analyses can be considered a form of deployment of the database, in the sense of deployment commonly used in the context of trained models in predictive data mining. The program can also compute word coefficients and document scores based on results from singular value decomposition performed in a prior analysis and stored in the database. Hence, these options enable you to compute scores for new documents based on a previous analysis; this functionality may be critical if you want to use information extracted from text in predictive data mining projects based on numeric indices that were derived during training from unstructured text.
Option | Description |
---|---|
Project | Use the options in this group box either to create a new database for indexing (project) or to select an existing database and index created in previous analyses. |
Create new project | Select this option button to create a new index; the index and database will be created in the location indicated in the Active project (database file) box, described below. |
Use Existing project | Select this option button to use an existing database, for example, to deploy an existing database to score new documents based on the information extracted in prior analyses. You can then select the database file using Select button, described below. |
Active project (database file or PMML for deployment) | Specify here the name of an existing database or the name for a new database to hold the index and other information computed by Statistica Text Mining and Document Retrieval. These database files can become quite large (with large collections of complex documents); hence, make sure to store this information on a hard disk with sufficient free space. |
Select | Click this button to display a standard file selection dialog box where you can browse to an existing database file (*.dbs) or to specify a new file name and location (depending on Project selection). You can also select a previously saved PMML file or a Statistica Enterprise-deployed PMML model; to do this, ensure that the Use existing project option button and the Deploy new documents option button are selected. |
Existing project | Use the options in this group box to specify how to use the information in an existing database (project); these options are only available if the Use existing project option button is selected in the Project group box. |
View / Modify (go to Results dialog) | Select this option button and click the View button to proceed directly to the Results dialog box where you can review the results from previous analyses as stored in the current database. This option is useful for reviewing the information from previous analyses and to update it by, for example, selecting different words or computing different indices. |
Merge new documents into existing index | Select this option button to append the indexing results for a new set of documents to an existing index/database instead of overwriting it. |
Deploy new documents | Select this option button to score the new documents, using the information contained in the current database and index. This option enables you to deploy the information in the existing database, in the sense of this term as it is commonly used in predictive data mining. You can use the information in the database to process the selected new documents, create results based on the previously selected words and terms, and compute word coefficients and document scores based on the singular value decomposition of results for documents used during training. |