Test Corpus Parsing

The TIBCO Data Science – Team Studio N-gram Dictionary Builder can parse a text corpus, create tokens, and then parse into all possible n-grams (combinations of sequential tokens).

In the following example, the text from Dr. Seuss would be treated as two "documents", with one per line.

one fish, two fish, 
red fish, blue fish

This would be parsed into the following n-grams:

Length 1 (Unigrams) one, fish, two, red, blue
Length 2 (Bigrams) one fish, two fish, red fish, blue fish
Length 3 (Trigrams) one fish two, red fish blue

The output of the N-gram Dictionary Builder operator would look like the following.

ngram size_of_ngram total_count_in_corpus number_of_documents
one 1 1 1
one fish 2 1 1
fish 1 4 2
...    

Keep in mind that each line in the file refers to one document.

Important: The data output of this operator is written to HDFS as a delimited file, but is not recognized by TIBCO Data Science – Team Studio as a tabular dataset. This is because it is a special n-gram dictionary type that is only recognized by the Text Featurizer operator.

Although you cannot connect a transformation operator such as Summary Statistics (DB) to this operator directly, you can go to the location of the results on HDFS (specified in the Summary tab of the results pane), drag the file(s) onto your workflow, and use that as input to other TIBCO Data Science – Team Studio operators. However, keep in mind that, if you use this method, the files might be stored in multiple parts.