Test Corpus Parsing
The Team Studio N-gram Dictionary Builder can parse a text corpus, create tokens, and then parse into all possible n-grams (combinations of sequential tokens).
In the following example, the text from Dr. Seuss would be treated as two "documents", with one per line.
one fish, two fish, red fish, blue fish
This would be parsed into the following n-grams:
Length 1 (Unigrams) | one, fish, two, red, blue |
Length 2 (Bigrams) | one fish, two fish, red fish, blue fish |
Length 3 (Trigrams) | one fish two, red fish blue |
Keep in mind that each line in the file refers to one document.
Although you cannot connect a transformation operator such as Summary Statistics to this operator directly, you can go to the location of the results on HDFS (specified in the Summary tab of the results pane), drag the file(s) onto your workflow, and use that as input to other Team Studio operators. However, keep in mind that, if you use this method, the files might be stored in multiple parts.