Web Crawling Document Retrieval Overview

General overview. The options in this dialog box will work in very similar ways, regardless of whether a directory structure on a storage device is searched for document files, or whether the web is "crawled" by retrieving the web pages linked to one or more (parent) Web pages. In general, in the left pane, select the root directories or web URLs where the search is to begin. Specifically, use the Add to crawl option to select the respective Destination (url or folder) to add them to the left pane. Next select the appropriate File filter to identify the type(s) of files you want to identify.

Ribbon bar: Select the Data Mining tab. In the Text Mining group, click Web Crawling to display the Web Crawling, Document Retrieval dialog box.

Classic menus: From the Data Mining menu, select Web Crawling, Document Retrieval to display the Web Crawling, Document Retrieval dialog box.

Use the options in this dialog box to produce a Statistica Spreadsheet (e.g., input spreadsheet) containing links or references to a list of files or documents, or web pages (URLs). These facilities provide flexible ways to automatically retrieve large lists of documents or web pages by "crawling" subdirectory structures or web links, and selecting files or pages that match particular filters. These options are particularly useful for retrieving links to documents for subsequent analyses via the Text and Document Mining options.