Issues and Considerations for Retrieving HTM/HTML Web Pages (Web Crawling)
Statistica includes several options for retrieving the contents of web pages, for example for subsequent text mining. See the documentation for the Web Crawling, Document Retrieval dialog box (accessible from the Data Mining tab, Text Mining group) or the Open Document Files dialog box (accessible from the Text Mining dialog box). Web pages are usually HTM or HTML documents that are displayed inside your internet browser, but these documents can be extremely rich and complex, as is quickly evident when perusing the web. Hence, there are a number of issues that should be remembered when retrieving automatically (using the respective Statistica options) such documents for later viewing or analyses (via Statistica Text Mining and Web Crawling).
Status | Description |
---|---|
When a Web page can't be found | When a Web page can't be found. During the course of the web crawling, if a link cannot be resolved (broken link), the server will often return an HTML page (document) reporting this error. The Crawler will still consider this page a valid page and add it to the list of retrieved/identified documents (e.g., to use as input for text mining, or to save to a disk for later processing). In general, Web pages (documents) that report an error will not be automatically filtered out in any way. |
The Web pages (*.html; *.htm) vs. the manual *.html;*.htm filter | There are differences between selecting as the (file) filter (for crawling) web pages (*.html; *.htm) versus manually typing *.html; *.htm into the File Filter edit box (e.g., in the Open Document Files dialog box accessed from the Text Mining dialog box - Quick tab or Advanced tab). When you select the Web pages (*.html; *.htm) filter, every link that the server will return as an HTML MIME (Multipurpose Internet Mail Extensions) type will be retrieved/identified by the crawler. This includes all those pages of type .asp, .cgi, etc., that will (can) return an actual HTML page. If you type in manually as the (file) filter *.html; *.htm, the program will only retrieve/identify pages (documents) with actual .html or .htm extensions. You can try both options with, for example, www.msn.com. As you might expect, using the manual *.html; *htm filter will return/identify not nearly as many pages/documents as are returned if you select the Web pages (*.html;*.htm) filter. Note also that selecting the Web pages (*.html; *.htm) filter will generally require more time because the crawler needs to verify each page during the process. |
Saving Web content (retrieved via crawling) to disk files for later viewing | If you select to save the results of the web-crawling operation to the disk (e.g., for later viewing of the web pages that were retrieved) be sure to clear the Save only filtered file results when crawling check box (in the Document Retrieval dialog box) so that the program will save all the components of the respective pages to the disk. Hence, by clearing this check box, you will later be able to see the retrieved pages as they were intended to be displayed (by the server). If the Save only filtered file results when crawling check box is selected, the resulting pages that are saved to the disk may be incomplete and not viewable. Selecting the Save only filtered file results when crawling check box is useful when you only want to save specific file types to the disk (e.g., from an intranet site). |
Adjusting the Level of depth parameter to retrieve real content | Some web pages may contain only a few links or scripts (or just one link) to load other pages. You may have to adjust the Level of depth option (in the Document Retrieval and Open Document Files dialog boxes of Text Miner) and request a greater depth, to get to the real content of the respective pages. For example, you may try to crawl - with Level of depth set to 1 - http://statistica.io: At the time of this writing, this web URL only contains a single link to the real content. |