Web Crawling Document Retrieval Dialog

The options in the Web Crawling, Document Retrieval dialog box will work in very similar ways, regardless of whether a directory structure on a storage device is searched for document files, or whether the web is "crawled" by retrieving the web pages linked to one or more (parent) Web pages.

Option Description
Level of depth ile directory structures on a storage device are usually organized hierarchically, i.e., folders contain subfolders, which contain other subfolders, and so on. Web pages are organized in a similar way, i.e., Web pages usually contain links to other pages, which themselves link to other pages, and so on. This parameter controls the depth to which the program will "crawl" to retrieve documents. If you specify the minimum value 1, only those documents found in the specified (Destination) directory or Web address will be retrieved. If you specify a number greater than 1, the documents found in the subdirectories under the Destination folder, or the Web pages referenced from the Web pages that are referenced at the root Destination URL will be retrieved as well, and so on.
Max. items in crawling tree Using the options available in this dialog box, it is possible to retrieve millions of documents (e.g., by crawling the web). The tree control that will be visible in the left pane as the tree crawling proceeds (in response to clicking the Start button) can significantly slow down operations when several thousands of folders and documents are retrieved. To avoid this performance problem, this parameter limits the number of items that will be created in the tree control (in the left pane). When that number is exceeded, if the crawling process has not finished, the program will automatically create a spreadsheet and redirect all the output (file or Web references) to that spreadsheet instead.
Domain restricted When crawling web pages (URLs), you can set this option to restrict the crawling operation to Web pages belonging to the parent domain(s) only. For example, if you type in http://statistica.io as the Destination URL, select the Domain restricted check box, and then Start crawling, only pages also belonging to this domain will be retrieved.
Done Click this button to close this dialog box.
Target (URL or Folder) In this field, specify the root folder or root URL for web crawling. After you specify the root, click the Add to Crawl button (see description below) to move this directory name or URL to the left pane and clear the Destination field. Hence, multiple root directories or web URLs can be specified and crawled.
Folder browser Click this button to display the Browse for Folder dialog box, which is used to browse to the file directory to transfer to the Target field
Option See Options Menu for descriptions of the commands on this menu
File filter In this field, specify a file filter for the crawling operation. Only file(s) of the specified type(s) will be retrieved during the crawling operation
Add to crawl Click this button to transfer the directory or base URL specified in the Target field to the left pane.
Start Click this button to begin the Crawl operation. The program will crawl the directory or web page (URL) structure beginning at the root Destinations selected into the left pane of this dialog box. A small help message at the bottom of the dialog box will update to inform you of the progress of the operation and when it is concluded (the message will read "Ready"). If the crawling process creates a directory structure (in the left pane) with more items than specified via the Max. items in crawling tree option (see above), it will automatically create a spreadsheet and redirect the output (links) to the first (text) variable in that spreadsheet
Stop As the crawling operation is in progress, the Stop button will be active, which is used to interrupt the crawling
Delete Click this button to delete highlighted (selected) items in the left pane; alternatively you can press the Delete key on your keyboard.
Clear Click this button to clear completely the left pane
Start & put the result directly to a spreadsheet Click this button to automatically create and "grow" a spreadsheet with two variables - the first text variable will contain the file references or web links (URLs) and the second variable will contain the root directory or URL where the respective document or web page (URL) was found. The spreadsheet will automatically grow (cases will be added) as the crawling is in progress. Note that during the crawling this spreadsheet is locked and cannot be accessed by another procedure.
Start & put the result directly to local folder Click this button to start the crawling process, and to place the retrieved documents and subdirectory structure into the location specified in the Content folder field. Thus, this option will retrieve and copy the actual documents or web pages from the locations where they were found during the crawling process.
Document list (Rght Pane) After the crawling process or even during the crawling process, if the output (selected folders and documents) are displayed as a directory tree in the left pane of this dialog box, you can select and transfer them to the Document list. You can also add the local file to this list by clicking the Add file button. From the links and file references in the Document list, you can make a spreadsheet or load their contents into a local directory structure (as specified in the Content folder field).
>> Transfer items (file references or web URLs) in the left pane of this dialog box by selecting (highlighting) the desired items, and then clicking the >> button to transfer them to the right pane (the Document list)
To select items in the left pane (the crawling tree). Use the standard Windows conventions:.
  • To select a continuous range of items, hold down the Shift key and click the first and last items in the range
  • To select more than one item that are not in a continuous range, hold down the Ctrl key and click on the items to add
Add file Click this button to display a standard file browser, which is used to add file references to the Document list.
Delete Click this button to delete highlighted (selected) items in the right pane; alternatively you can press the Delete key on your keyboard.
Create a spreadsheet from the document list Click this button to create a spreadsheet with the file references or web URLs from the files shown in the Document list. The spreadsheet that will be created will have two variables - the first will contain the actual complete file or web reference (URL); the second will contain the root directory or web address (URL) where the respective documents were found.
Load web contents from the list to local folder Use this option to save the actual web pages shown in the Document list to a local directory. Use the options in the Folder Option group box (see description below) to specify the local storage location where you want to save those documents
Folder Option Use the options in this group box to specify the location on the local storage device where you want to place the documents that are retrieved using the Start & put the result directly to local folder or Load web contents from the list of local folder options
Save only filtered file results when crawling: When using the Start & put the result directly to local folder option, you can either transfer all items that were found during the crawling operations to the local storage device (to the directory specified in the Content folder field), or you can save only those files consistent with the File filter that was specified for the crawling operation.

Content folder:

In this field, specify the (local) storage directory and folder where you want to place the actual documents that were retrieved during the crawling operation.

Browse:Click this button to display the Browse for folder dialog box, used to select a specific location (folder) for the Content folder option.