Workspace Node: Text Mining - Results - Frequency Measure Tab
In the Text Mining node dialog box, under the Results heading, select the Frequency Measure tab to access the following options. See also the Introductory Overview.
- Frequency measure
- There are various statistical summaries that can be computed for each word (within each document). These are mostly simple transformations of the original word frequencies, in order to achieve more meaningful indices with values and distributions (e.g., of the words across the documents) that are more suitable for subsequent analyses using other statistical or data mining techniques.
Use the options in this group box to choose one of these common transformations (or to use raw word frequencies). When you request the Frequency matrix (from the **Summary tab), or perform singular value decomposition (via the **Concept extraction tab), the respective computations and summaries are computed and reported for the chosen transformation only (e.g., singular value decomposition can be performed for the raw Frequency counts, Inverse document frequency statistics, and so on). For additional information, see also the Introductory Overview.
- Inverse document frequency [recommended]
- Select this option button to analyze and report inverse document frequencies. One issue that you may want to consider more carefully, and reflect in the indices used in further analyses, are the relative document frequencies (df) of different words. For example, a term such as "guess" may occur frequently in all documents, while another term such as "software" may only occur in a few. The reason is that one might make "guesses" in various contexts, regardless of the specific topic, while "software" is a more semantically focused term that is only likely to occur in documents that deal with computer software. A common and very useful transformation that reflects both the specificity of words (document frequencies) as well as the overall frequency of their occurrences (word frequencies) is the so-called inverse document frequency (for the
i'th word and
j'th document):
In this formula (see also formula 15.5 in Manning and Schütze, 2002), N is the total number of documents, and dfi is the document frequency for the i'th word (the number of documents that include this word). Hence, it can be seen that this formula includes both the dampening of the simple word frequencies via the log function, and also includes a weighting factor that evaluates to 0 if the word occurs in all documents (log(N/N=1)=0), and to the maximum value when a word only occurs in a single document (log(N/1)=log(N)). It can easily be seen how this transformation will create indices that both reflect the relative frequencies-of-occurrences of words, as well as their semantic specificities over the documents included in the analysis.
- Raw
- This is the default selection that will let you operate on raw word frequencies collected in the term-document index.
- Binary
- Select this option button to analyze and report binary indicators instead of word frequencies. Specifically, this option will simply enumerate whether or not a term is used in a document; i.e.:
f(wf) = 1, for wf>0
Where wf stands for word frequency within each document. The resulting documents-by-words matrix will contain only 1's and 0's, to indicate the presence or absence of the respective word. As the other transformations of simple word frequencies, this transformation will dampen the effect of the raw frequency counts on subsequent computations and analyses.
- Logarithmic
- Select this option button to analyze and report logs of the raw word frequencies. A common transformation of the raw word frequency counts (wf) is to compute:
f(wf) = 1+log(wf), for wf>0
This transformation will "dampen" the raw frequencies and how they will affect the results of subsequent computations.
Options / C. See Common Options.