Test Corpus Parsing

The Team Studio N-gram Dictionary Builder can parse a text corpus, create tokens, and then parse into all possible n-grams (combinations of sequential tokens).

In the following example, the text from Dr. Seuss would be treated as two "documents", with one per line.

one fish, two fish, 
red fish, blue fish

This would be parsed into the following n-grams:

Length 1 (Unigrams) one, fish, two, red, blue
Length 2 (Bigrams) one fish, two fish, red fish, blue fish
Length 3 (Trigrams) one fish two, red fish blue
The output of the N-gram Dictionary Builder operator would look like the following.
ngram size_of_ngram total_count_in_corpus number_of_documents
one 1 1 1
one fish 2 1 1
fish 1 4 2
...

Keep in mind that each line in the file refers to one document.

Important: The data output of this operator is written to HDFS as a delimited file, but is not recognized by Team Studio as a tabular dataset. This is because it is a special n-gram dictionary type that is only recognized by the Text Featurizer operator.

Although you cannot connect a transformation operator such as Summary Statistics to this operator directly, you can go to the location of the results on HDFS (specified in the Summary tab of the results pane), drag the file(s) onto your workflow, and use that as input to other Team Studio operators. However, keep in mind that, if you use this method, the files might be stored in multiple parts.