Date of Submission


Date of Award


Institute Name (Publisher)

Indian Statistical Institute

Document Type

Doctoral Thesis

Degree Name

Doctor of Philosophy

Subject Name

Computer Science


Machine Intelligence Unit (MIU-Kolkata)


Pal, Sankar Kumar (MIU-Kolkata; ISI)

Abstract (Summary of the Work)

The World Wide Web [12] (usually referred to as the Web, WWW or W3) is an enormous collection of data available over the Internet, which is a vast network of computers. It was created in the year 1990 by Tim Berners-Lee, while he worked at CERN, Switzerland, and was made available over the Internet in 1991. The World Wide Web Consortium [136] authoritatively defines the Web as the universe of network-accessible information, the embodiment of human knowledge. The Web consists of objects, also called documents or pages in a generic sense, that are identified using a Uniform Resource Identifier (URI), or what has more popularly come to be known as the Uniform Resource Locator (URL), and these objects are connected to each other by means of hyperlinks. This interlinked nature of the Web distinguishes it from text corpora and other such collections. The Web is a rapidly changing and expanding resource, and over the years, it has seen a phenomenal growth in both its size and diversity. Starting from a single web site ( in 1990, it is now made up of millions of web sites. Currently, the indexable Web itself consists of billions of heterogeneous documents — in the year 2005, Yahoo! [145] had announced to have indexed over 20 billion documents [146], and Google [52] countered this statement by claiming to have indexed at least thrice as many documents than that [53]. These are all under-estimates, as they take into account only what has been crawled, and the true size of the Web is unknown.There are a wide variety of data sources that contribute to the richness of the Web. We list a few of these so that one may gauge the root of the heterogeneous nature of the Web:• Content created for the Web, and published by the authors: This is usually made up of textual (in either text or HTML formats), image, audio, or video content, and is generally created and distributed by professionals.• Content created for other purposes, and now made available on the Web: Examples include music, movies, and printed books.• Public mailing lists and discussion forums: A lot of knowledge as well as entertaining articles are shared in this form.• User generated content: This recent phenomenon is about content being generated by end-users, as opposed to professionals, and is changing the very face of the web. This includes images, audio and video submitted by users of web sites and blogs.• The deep web: This refers to content that is generated on the fly, and is generally not indexed by search engines. New content that is being generated may be in response to some user input (explicit inputs may be user’s identity or query terms, while implicit input could be geographic location of the user’s IP or browsing context), or may depend on other factors (examples include time and changes in other parts of the Web).• Activity logs: While the earlier data sources were explicitly created by authors and visitors of web sites, their activity itself, when recorded and stored, constitutes another type of web data.


ProQuest Collection ID:

Control Number


Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Included in

Mathematics Commons