Test Corpus Parsing
The TIBCO Data Science – Team Studio N-gram Dictionary Builder can parse a text corpus, create tokens, and then parse into all possible n-grams (combinations of sequential tokens).
In the following example, the text from Dr. Seuss would be treated as two "documents", with one per line.
one fish, two fish, red fish, blue fish
This would be parsed into the following n-grams:
Length 1 (Unigrams) | one, fish, two, red, blue |
Length 2 (Bigrams) | one fish, two fish, red fish, blue fish |
Length 3 (Trigrams) | one fish two, red fish blue |
The output of the N-gram Dictionary Builder operator would look like the following.
ngram | size_of_ngram | total_count_in_corpus | number_of_documents |
---|---|---|---|
one | 1 | 1 | 1 |
one fish | 2 | 1 | 1 |
fish | 1 | 4 | 2 |
... |
Keep in mind that each line in the file refers to one document.
Although you cannot connect a transformation operator such as Summary Statistics (DB) to this operator directly, you can go to the location of the results on HDFS (specified in the Summary tab of the results pane), drag the file(s) onto your workflow, and use that as input to other TIBCO Data Science – Team Studio operators. However, keep in mind that, if you use this method, the files might be stored in multiple parts.