A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm
ABSTRACT
As the World Wide Web is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search engine based architectural model for people to search through the Web. Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such Web search engines. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the Web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. In this paper, we describe the design of a Web crawler that uses PageRank algorithm for distributed searches and can be run on a network of workstations. The crawler scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present Web mining architecture of the system and describe efficient techniques for achieving high performance.