Open Document Files

Click the Browse documents button on the Text mining dialog box Quick tab or Advanced tab to display the Select documents dialog box. Use the options in this dialog box to select the document files to be analyzed (indexed). The basic algorithm implemented in Statistica Text Mining and Document Retrieval supports various input file formats, including MS Word® documents, rich text files (RTF), PDF, htm and html (downloaded web pages or URL addresses), XML, and text.

Click the Add file button to browse to the files to be included. Click the More button to display more options; you can select lists of files or URL (Web) addresses via wildcards or by crawling the web from a particular domain name or parent address.

Option Description
Document list The selected files or web (URL) addresses will be listed in the Document list box. You can delete an item from this list by clicking it and then clicking the Delete button or pressing the Delete key on your keyboard.
OK Click this button to select the files or URL addresses displayed in the Document list and to close the dialog box.
Cancel Click this button to close this dialog box without selecting any files or URL addresses.
Add file Click this button to display the standard file browser dialog box to select a list of document files; after closing that dialog box, the selected files will be transferred into the Document list box.
Delete Click this button or press the Delete key on your keyboard to remove the documents currently selected (highlighted) in the Document list.
More/Less Click this button to expand the Open document files dialog box and expose the crawler options and interface. The More button becomes a Less button, and you can use the additional options to select the document files to be analyzed (indexed) by crawling either a file directory structure on the hard disk or other storage device, or by crawling web pages (URL addresses) starting at particular starting pages. Click the Less button (previously the More button) to close the lower portion of the dialog box and return to the reduced set of options described above.

The basic algorithm implemented in Statistica Text Mining and Document Retrieval supports various input file formats, including MS Word® documents, rich text files (RTF), PDF, htm and html (downloaded web pages or URL addresses), XML, and text.

Note: Overview of Web- or file-crawling options. Use the options in the lower portion of the Open document files dialog box to select a list of root directories or web addresses, and to automatically retrieve all documents found there or further down the hierarchy of files or web addresses. For example, if you specify a subdirectory c:\MyDocuments and click the Start crawling button, all files of the type specified in the File filter field that are found in this directory will automatically be retrieved and displayed in the box in the expanded portion of the dialog box. Further, if the Level of depth is set to a value greater than 1, the program will also look for documents (of the requested type, consistent with the File filter) in the subdirectories inside c:/MyDocuments, down to the Level of depth of this directory structure as requested.

Web-crawling works in very much the same way. The program connects to the Destination web pages selected (into the box in the expanded portion of the dialog box), and when you click the Start crawling button, the program will retrieve all web pages referenced in that starting web page. Put another way, the program will click every link found in the initial web page to retrieve all pages "below" it, down to the level specified in the Level of depth field.

Selecting files or URLs into the Document list: After the crawling operation has been completed, you can select files in the box in the expanded portion of the dialog box, using the standard Windows conventions (click on a file or URL to highlight/select it; to select more than one item hold down the Ctrl key and click on the items to add; to select a range of items, hold down the Shift key and click the first and last items in the range of items to select). Then click the Select button to transfer the selected files or web URLs to the Document list.

Retrieving very large collections of documents:If your work requires the retrieval of or crawling to a very large number of documents, you should select the Data Mining tab. In the Text Mining group, click Web Crawling to create an input spreadsheet with the links to the specific documents. The options available via this user interface are more flexible for automatically retrieving large numbers of documents, and for producing an input spreadsheet with a text variable containing the links. You can then use the options in the Text Mining dialog box (on the Quick or Advanced tab) to specify the (text) variable containing the references to the input documents, and in that manner process practically unlimited numbers of documents in a single analysis (since Statistica input spreadsheets are not limited to a maximum size).

Option Description
Level of depth This parameter controls the depth to which the program will crawl to retrieve documents. If you specify the minimum value 1, only those documents found in the specified (Destination) directory or web address will be retrieved. If you specify a number greater than 1, the documents found in the subdirectories under the Destination folder, or the web pages referenced from the web pages that are referenced at the root Destination URL will be retrieved as well, and so on.
Maximum results The program will retrieve documents of the type(s) specified in the File filter field; with this option you can limit the maximum number of documents that will be collected and displayed in the box in the expanded section of this dialog box (where they can be selected into the Document list).
Domain restricted When crawling web pages (URLs), you can select this check box to restrict the crawling operation to web pages belonging to the parent domain only. For example, if you type in http://statistica.io as the Destination URL, select the Domain restricted check box, and then click the Start crawling button, only pages also belonging to this domain would be retrieved.
Destination In this box, specify the root folder or root URL for web crawling. After you specify the root, click the Add to crawl button to move this directory name or URL to the list box in the expanded portion of the dialog box, and to clear the Destination field. Hence, multiple root directories or web URLs can be specified and crawled.

... Click the button to display a standard Windows file browser to browse to the file directory to transfer to the Destination field

File filter In this box, specify a file filter for the crawling operation. Remember that the basic algorithm implemented in Statistica Text Mining and Document Retrieval supports various input file formats, including MS Word® documents, rich text files (RTF), PDF, htm and html (downloaded web pages or URL addresses), XML, and text.
Add to crawl Click this button to transfer the directory or base URL specified in the Destination field to the box in the extended section of this dialog box.
Start crawling Click this button to begin the crawl operation. The program will crawl the directory or web page (URL) structure beginning at the root Destinations selected into the box in the extended section of this dialog box. As the crawling operation is in progress, the Stop crawling button will become active, in order to interrupt the crawling. Also, a small help message at the bottom of the dialog box will update to inform you of the progress of the operation, and when it is concluded (the message will read "Ready").
Stop crawling This option is only available (not dimmed) while the crawling operation is in progress. Click this button to stop the crawling operation.
Clear Click this button to clear the list of files displayed in the box in the extended section of this dialog box, e.g., to start another crawl operation.
Select Click this button to select the currently highlighted items (files, web URLs) in the box in the extended portion of the dialog box, and to transfer them into the Document list (to select them for the analyses).