Workspace Node: Text Mining - Specifications - Filters Tab

In the Text Mining node dialog box, under the Specifications heading, select the Filters tab to access options to specify various parameters that define valid words to be included in the indexing. Note that the program will create an index of words and terms and select a certain number of terms for further analyses and reporting. Many more words than those that are selected can be indexed and can be accessible later by selecting them, provided that the Keep unselected words in database for browsing option is selected on the Advanced tab. Otherwise, unselected words will be discarded.

Use the options on this tab to prevent the indexing of particular words, and in this case those words cannot later be reselected for further analyses. Obviously, it is desirable for performance reasons to keep the list of indexed words as small as possible, in particular when indexing very large document collections.

See also the Introductory Overview.

Element Name Description
Word length
Min Specify the minimum number of characters permissible in a word; words that are shorter than specified will not be indexed and will be excluded from the analysis.
Max Specify the maximum number of characters permissible in a word; words that are longer than specified will not be indexed and will be excluded from the analysis.
Min stem length Specify the minimum number of characters permissible in an indexed word after stemming; words that are shorter than specified indexed and will be excluded from the analysis.
Min num of vowels Specify the minimum number of vowels permissible in a word; words with fewer vowels than specified will not be selected (or indexed, unless the Keep unselected words in database for browsing check box is selected on the Advanced tab), and will be excluded from the analysis.
Maximum number of consecutive
Consonants Specify the maximum number of consecutive consonants permissible in a word; words with more consecutive consonants than specified will not be indexed and, hence, will be excluded from the analysis.
Vowels Specify the maximum number of consecutive vowels permissible in a word; words with more consecutive vowels than specified will not be indexed and, hence, will be excluded from the analysis.
Duplicates Specify the maximum number of consecutive identical characters permissible in a word; words with more consecutive identical characters than specified will not be indexed and, hence, will be excluded from the analysis.
Punctuations Specify the maximum number of consecutive punctuations permissible in a word; words with more consecutive punctuations than specified will not be indexed and, hence, will be excluded from the analysis. Note that this option interacts with the option Characters for word on the Characters tab. Specifically, what constitutes a "punctuation" here is determined by the punctuation characters specified there. By default, the only punctuation character is "-" (the dash), and if this parameter is set to 1, only words including 1 consecutive dash are permissible.

Options / C. See Common Options.