Test Corpus Parsing

The Team Studio N-gram Dictionary Builder can parse a text corpus, create tokens, and then parse into all possible n-grams (combinations of sequential tokens).

In the following example, the text from Dr. Seuss would be treated as two "documents", with one per line.

one fish, two fish, 
red fish, blue fish

This would be parsed into the following n-grams:

Length 1 (Unigrams)	one, fish, two, red, blue
Length 2 (Bigrams)	one fish, two fish, red fish, blue fish
Length 3 (Trigrams)	one fish two, red fish blue

The output of the N-gram Dictionary Builder operator would look like the following.

ngram	size_of_ngram	total_count_in_corpus	number_of_documents
one	1	1	1
one fish	2	1	1
fish	1	4	2
...

Keep in mind that each line in the file refers to one document.

Important: The data output of this operator is written to HDFS as a delimited file, but is not recognized by Team Studio as a tabular dataset. This is because it is a special n-gram dictionary type that is only recognized by the Text Featurizer operator.

Although you cannot connect a transformation operator such as Summary Statistics to this operator directly, you can go to the location of the results on HDFS (specified in the Summary tab of the results pane), drag the file(s) onto your workflow, and use that as input to other Team Studio operators. However, keep in mind that, if you use this method, the files might be stored in multiple parts.

Related concepts

NLP Use Case

LDA Training and Model Evaluation Tips

Unsupervised Text Mining

Related tasks

Using the Results of Text Featurizer

Contents

Index

Search Results

Test Corpus Parsing