Article Details

An Analysis on the Anatomy of a Large Scale Hypertextual Web Search Engine and a Web Crawler Application |

Ms. Rubina Khan, in International Journal of Information Technology and Management | IT & Management

ABSTRACT:

In this paper, we present Google, a prototype of a large-scale searchengine which makes heavy use of the structure present in hypertext. Google isdesigned to crawl and index the Web efficiently and produce much moresatisfying search results than existing systems. The prototype with a full textand hyperlink database of at least 24 million pages is available athttp://google.stanford.edu/ To engineer a search engine is a challenging task.Search engines index tens to hundreds of millions of web pages involving acomparable number of distinct terms. They answer tens of millions of queries everyday. Despite the importance of large-scale search engines on the web, verylittle academic research has been done on them. Furthermore, due to rapidadvance in technology and web proliferation, creating a web search engine todayis very different from three years ago. This paper provides an in-depthdescription of our large-scale web search engine – the first such detailedpublic description we know of to date. Apart from the problems of scalingtraditional search techniques to data of this magnitude, there are newtechnical challenges involved with using the additional information present inhypertext to produce better search results. This paper addresses this questionof how to build a practical large-scale system which can exploit the additionalinformation present in hypertext. Also we look at the problem of how toeffectively deal with uncontrolled hypertext collections where anyone canpublish anything they want. To engineer a search engine is a challenging task. Search enginesindex tens to hundreds of millions of Web pages involving a comparable numberof distinct terms. They answer tens of millions of queries every day. Despitethe importance of large-scale search engines on the Web, very little academicresearch has been done on them. Furthermore, due to rapid advance in technologyand Web proliferation, creating a Web search engine today is very differentfrom three years ago. This paper provides an in-depth description of ourlarge-scale Web search engine — the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques todata of this magnitude, there are new technical challenges involved with usingthe additional information present in hypertext to produce better searchresults.