TextMiner

Provide powerful tools to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms available in the STATISTICA system. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project.

General

Element Name Description
Select database file for text mining Give database file names as well as its path to store all information of text mining project including analysis specification and results to the file'|'
Create new project Create database for new project. If this option is selected and the specified database file exists, system will delete the file from disk and create new file with the same name. When this option is not selected, user must specify what to do with existing project database in 'Type of work with existing project'
Type of work with existing project" This option applies only if option 'Create new project' is not selected. User can specify this option to update new text or documents to existing project database, deploy new text or documents according to the results from the database, or reproduce results from the database.
Get text from file When this option is selected, system gets file names instead of texts itself from specified variable and extracts texts from those files in disk.
Stemming language Specify a language which stemming algorithm applies according to Detail of computed results reported. Detail of results reported. If Comprehensive level of detail is selected, summaries of words and documents will be generated; When All results is selected, singular value decomposition of frequency matrix will be performed in addition to basic results.
Statistic for occurrence There are various statistical summaries that can be computed for each word (within each document). These are mostly simple transformations of the original word frequencies, in order to achieve more meaningful indices with values and distributions (e.g., of the words across the documents) that are more suitable for subsequent analyses using other statistical or data mining techniques. Use the options in the Statistic for occurrence box to choose one of these common transformations. When you request the Summary of word occurrence in document, or perform singular value decomposition, the respective computations and summaries are computed and reported for the chosen transformation only (e.g., singular value decomposition can be performed for the raw Frequency counts, Inverse document frequency statistics, and so on).

Filters 1

Element Name Description
Maximum number of indexed words Specify an integer number for the approximate number of indexed words for the analysis. Note that a useful maximum number of indexed and selected (for the final results) terms will rarely exceed 1,000 words or so. Common words (such as the English 'the,' 'also,' etc.) will by default be excluded via the Exclude stop words list as selected on the Index tab. Rare or unusual words that do not occur in a minimum number of documents, will also by default be excluded (see options Min% of files word occurs and Max % of files word occurs, on the Measure tab), as they are typically not useful (diagnostic) for characterizing the dimensions and structure of the documents and terms contained in them (e.g., if only one document in 10,000 documents contains the word 'stentorian,' that word is not particularly useful to characterize the contents and underlying dimensions that enable us summarize the collection of documents in a meaningful way). The number of words or terms in the final index may not be exactly as specified here, because it is determined after all other filters and conditions are applied.
Minimum number of characters in a word Specify the minimum number of characters permissible in a word; words that are shorter than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab) and will be excluded from the analysis.
Maximum number of characters in a word Specify the maximum number of characters permissible in a word; words that are longer than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab), and will be excluded from the analysis.
Maximum number of characters in an indexed word Specify the minimum number of characters permissible in an indexed word after stemming; words that are shorter than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab), and will be excluded from the analysis.
Min num of vowels Specify the minimum number of vowels permissible in a word; words with fewer vowels than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab), and will be excluded from the analysis.
Max num of consec. consonants Specify the maximum number of consecutive consonants permissible in a word; words with more consecutive consonants than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab), and will be excluded from the analysis.
Max num of consec. vowels Specify the maximum number of consecutive vowels permissible in a word; words with more consecutive vowels than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab), and will be excluded from the analysis.
Max num of consec. same chars Specify the maximum number of consecutive identical characters permissible in a word; words with more consecutive identical characters than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab), and will be excluded from the analysis.
Max num of consec. punctuations Specify the maximum number of consecutive punctuations permissible in a word; words with more consecutive punctuations than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing option is selected on the Index tab), and will be excluded from the analysis. Note that this option interacts with the option Characters for word on the Characters tab. Specifically, what constitutes a 'punctuation' here is determined by the punctuation characters specified there. By default, the only punctuation character is '-' (the dash), and if this parameter is set to 1 then only words including 1 consecutive dash are permissible.

Filters 2

Element Name Description
Min percent of files word occurs Specify the minimum permissible document frequency (specify an integer percentage value) for the analysis. Words that occur in fewer than the indicated percentage of documents will be deemed non-diagnostic, and excluded from the analysis (e.g., if only a very small percentage of documents contain the word 'stentorian,' then that word is not particularly useful to characterize the contents and underlying dimensions that enable us to summarize the collection of documents in a meaningful way).
Max percent of files word occurs Specify the maximum permissible document frequency (specify an integer percentage value) for the analysis. Words that occur in more than the indicated percentage of documents will be deemed non-diagnostic, and excluded from the analysis (e.g., if a very large percentage of documents contain the word 'tree,' then that word may not be particularly useful to characterize the contents and underlying dimensions that enable us to summarize the collection of documents and differentiate between them in a meaningful way).
Characters allowed in word Specify the set of permissible characters that can be included in valid words. Words that contain characters not in this list will not be indexed and will be excluded from the analysis.
Characters allowed to begin word Specify the set of permissible characters that may begin valid words (may be the first letters in those words). Words that begin with characters not contained in this list will not be indexed and will be excluded from the analysis.
Characters allowed to end word Specify the set of permissible characters that may end valid words (may be the last letters in those words). Words that end with characters not contained in this list will not be indexed and will be excluded from the analysis.

Index

Element Name Description
Keep unselected words in database Select this option to index unselected words; as mentioned above, it is important to distinguish between selected and unselected words vs. indexed and non-indexed words. Words or terms can be indexed in the (internal) database but not selected into the word list from which final results are computed (e.g., singular value decomposition). Selected words may or may not be indexed, depending on the selection of this option.
Select only inclusion words Select this option to index (and include in the analysis) only those words specified in the list of inclusion words; that list can reside in a simple text file, selected via the Include file option (see below).
Inclusion file Select a file including the words and terms that are to be indexed, selected, and included in the analyses. The file should be a simple text file, where each term or word is placed on a separate line. This option is only available if the Select only inclusion words for analysis option (see above) is selected.
Exclude stop words Select this option to exclude specific words from the index and, hence, from the analyses and final results.
Stop-word file. Stop-word file Select a file that includes the list of stop-words or terms that are to be ignored during indexing. This option is only available if the Exclude stop words option (see above) is selected. Stop words should be contained in a simple text file, one word or term per line. These files can further be edited via the Edit stop-word file option (see below).
Combine synonyms Select this option to specify synonyms for indexing.
Synonym file Select a file including the synonym list for indexing. This option is only available when the Combine synonyms option (see above) is selected.
Include phrases Select this option to include phrases for indexing.
Phrase file Select a file including the phrases for indexing. These should be simple text files with a single phrase per line. To identify synonyms, each line of the text file should be structured like this:Root term: Synonym1, Synonym2, ..., Synonymk. For example: Meal: Breakfast, Lunch, Dinner, Supper

Delimiters

Element Name Description
Index words only between starting and ending phrases Select this option to activate the conditional processing of specific portions of the text in each document. After selecting this option, the Starting phrase and Ending phrase options (see below) will be enabled.
Phrase to start indexing This option is only available after the Index words only between starting and ending phrases option (see above) has been selected. In this box, specify the starting phrase, i.e., the processing of the text in each document begins after the place in the text where the this phrase appears.
Phrase to end indexing This option is only available after the Index words only between starting and ending phrases option has been selected. In this box, specify the ending phrase, i.e., the processing of the text in each document will terminate at the text immediately preceding the phrase specified in this field.
Generates data source, if N for input less than Generates a data source for further analyses with other Data Miner nodes if the input data source has fewer than k observations, as specified in this edit field; note that parameter k (number of observations) will be evaluated against the number of observations in the input data source, not the number of valid or selected observations.