Program Overview

Statistica Text and Document Mining and Web Crawling is a general text mining tool for indexing text in various languages, i.e., for counting the number of times that terms occur in the input documents. The program includes a large number of options for stemming words (terms), for handling synonym lists and phrases, and for summarizing the results of the indexing using various indices and statistical techniques. Flexible options are available for finalizing a list of terms that can be "deployed," to quickly score ("numericize") new input texts. Efficient methods for searching indexed documents are also supported.

Input Documents

Statistica Text and Document Mining accepts as input documents in a variety of formats, including MS Word® document files and rich text files (RTF), PDF (Acrobat Reader®), htm and html (web pages or URL addresses), XML, and text files. You can also specify a variable in the Statistica input spreadsheet containing the actual text itself.

Selecting Input Documents

Input documents can be selected in a variety of ways. File names and directories (references to input documents) can be stored in a variable in an input spreadsheet, or you can "crawl" through directories and subdirectory structures to retrieve files of particular types. In addition, various methods for accessing Web pages and for "crawling" the Web (retrieving all Web pages linked to a particular document specified as the root). Web crawling can be performed to a user-defined depth, e.g., you can request to retrieve all web sites linked to pages that are referenced from a particular root URL, pages that are referenced in those pages, and so on.

Stop Lists, Synonyms, and Phrases

Various options are available for specifying lists of words (terms) that are to be excluded from the indexing of the input documents or pairs of terms that are to be treated as synonyms (i.e., counted as the same word). It can be specified to treat specific phrases (e.g., "Eiffel Tower") as single terms and entries in the index. These lists can be edited and saved for future and repetitive use so the system can be customized to specific terminologies for different domains.

Stemming, and Support for Different Languages

Stemming refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word. For example, stemming ensures that both "travel" and "traveled" will be recognized by the program as the same word. Statistica Text and Document Mining includes stemming algorithms for most European languages including English, French, German, Italian, and Spanish.

Indexing of Input Documents; Scalability of Statistica Text Mining and Document Retrieval

The indexing of the input documents is extremely fast and efficient, and based on relational database components built into the program. The contents of this database can be saved for further updating in future sessions, or for "deployment," i.e., to score input documents using only previously selected key terms.

Results, Summaries, and Transformations

The Text Mining Results dialog box contains numerous options for summarizing the frequency counts of different words and terms. You can also combine terms or phrases (to count them as a single term or phrase), or clear only some of the terms in the analyses.

Options are available for reviewing word/term frequencies or document frequencies, as well as transformations of those frequencies better suited for subsequent analyses (e.g., inverse document frequencies). The Results dialog box also contains options for performing singular value decomposition on the documents-by-terms frequency matrix (or transformations of frequencies) to extract dominant "dimensions" into which terms and documents can be mapped (see also latent semantic indexing).

The scores and coefficients for the extracted dimensions can also be saved for subsequent processing of new documents to map those documents into the same space. Because of the integrated architecture of the Statistica system, all results spreadsheets can be used as input data for subsequent analyses or graphs. Hence, it is easy to apply any of the large number of analytic algorithms available in Statistica to the results reported by the Text Mining and Document Retrieval module, for example, to apply cluster analysis methods or any of the methods for predictive data mining to include textual information in those projects.

There are also options available to write the results computed by the program out to existing input files or external databases, e.g., to score new text available in an external database and, thus, be able to, for example, compute predicted values based on a previously trained model.