Text Extractor

Using the Text Extractor, users can select an HDFS input directory that contains a set of documents, and then parse that content to create a new data set that contains the parsed text.

Information at a Glance

Category	NLP
Data source type	HD
Sends output to other operators	Yes
Data processing tool	Spark

Input

No input needs to be directly connected. Select an input directory from the parameter configuration dialog box.

Bad or Missing Values: If the operator encounters an error while reading or parsing a document, it flags as true in the read_or_parse_error column, and the text of the error is displayed in the text_content column. Such an error can occur if the user does not have read permissions on the selected directory (or specific files), or if a file is corrupted.

Restrictions

Text Extractor accepts only the following file types.

.doc
.docx
.html
.log
.pdf
.ppt
.pptx
.rtf
.txt
.xml

Text Extractor does not preserve the structure of the document; it only parses the text data. Thus, the structure of the original document might be lost.

If the fonts in your document use a non-standard encoding and the document structure does not contain a /ToUnicode table associated with these fonts, the text content extracted might be garbled. Many different encodings and fonts exist, and it is not possible to predict all of them. Some files are produced without this important metadata. Even though you can display and print the file properly, the file does not contain information about the meaning of the font/letter shapes. In this case, you must recreate the file or use OCR. (source)

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Data Source (HD)	The Hadoop data source.
Input Directory	The input directory that contains the files to parse (wildcards and patterns are supported, as well as single file selection). Tip: The input directory path can be entered manually, and the user can enter a regular expression as path pattern (for example, /dir/user/projectA) The operator parses only files with selected extensions in the chosen directory its tree of subdirectories (other files are skipped). If no files with the selected extensions are found, the output is empty and the following error message is displayed in the addendum: "No files with selected extension were found in the input directory and subdirectories" Caution: Invalid Filenames Filenames with the following characters: {}[],\| are not supported and cause the job to fail.
File Formats to Parse	Extensions of the files to parse from the available options. Note: The filenames must explicitly include the extension. For example, a PDF file titled mydoc is not read, but a PDF file titled mydoc.pdf is read.
Maximum Number of Characters per File	If a file has more characters than this limit, the file is not parsed. The default limit is 10,000,000 characters. The column `read_parse_error` is set to true and an error is displayed in the output column `text_content`. Caution: Parsing Large Files This limit is set to avoid the Spark job hanging because the directory contains huge files that a user could try to parse by mistake. To parse these large files, increase this limit. Doing so may require tuning the Spark memory settings.
Storage Format	Select the format in which to store the results. The storage format is determined by your type of operator. Typical formats are Avro, CSV, TSV, or Parquet.
Compression	Select the type of compression for the output. Available Parquet compression options. GZIP Deflate Snappy no compression Available Avro compression options. Deflate Snappy no compression
Output Directory	The location to store the output files.
Output Name	The name to contain the results.
Overwrite Output	Specifies whether to delete existing data at that path. Yes - if the path exists, delete that file and save the results. No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output

Data Output

This operator outputs a tabular data set (.TSV) with the following six columns.

doc_index - a unique index created to identify the document.
file_path - the original file path.
file_extension - the extension of the file.
text_content - the text content parsed from the document.
read_or_parse_error - a boolean value that determines whether an error occurred while reading/parsing this document.
- true - an error occurred while reading or parsing. If an error occurred, it appears in the text_content column.
- false - no errors occurred while parsing this document.
is_empty - a boolean value that is set to true if the file to be read is empty (or does not contain any alphanumeric characters).

Related reference

N-gram Dictionary Loader

N-gram Dictionary Builder

Contents

Index

Search Results