Web Connectivity Analysis

Web Connectivity Analysis

There exist different algorithms to extract information from the pattern of links (connectivity) between web pages

The links connecting documents in the web are in principle all equivalent: the web itself does not express an preference for one link or one document above another. Yet, the connectivity or pattern of linkages between pages does contain a lot of implicit information about the relative importance of links. The author of a web document will normally only include links to other documents that are relevant to the general subject of the page, and of sufficient quality. Thus, locating one document relevant to your goals may be sufficient to guide you to further information on that issue. High quality documents, that contain clear, accurate and useful information, are likely to have many links pointing to them, while low quality documents will get few or no links. Thus, although no explicit preference function is attached to a link, there is a preference implicit in the total number of links pointing to a document. This preference is produced collectively, by the group of all web authors.

There exist different mathematical techniques to extract this information. Recently, two types of algorithms have been developed for this purpose: PageRank (Brin & Page 1998) and HITS (Kleinberg 1998). Both use a bootstrapping approach: they determine the quality or "authority" of a web page on the basis of the number and quality of the pages that link to it. Since the definition is recursive (a page has high quality if many high quality pages point to it), the algorithm needs several iterations to determine the overall quality of a page. Mathematically, this is equivalent to computing the eigenvectors of the matrix that represents the linking pattern in the selected part of the web. PageRank uses the linking matrix directly, HITS uses a product of the matrix and its transposed matrix. The latter method produces two types of pages: authorities, that are pointed to by many good "hubs" (indexes or lists of web pages), and hubs, that point to many good authorities. In combination with a keyword search, which restricts the pages for which the quality is computed to a specific problem "neighborhood", these methods seem to produce a much better quality in the answers returned for a query.

The disadvantage of these methods is that they are static: they merely use the (rather sparse) linking pattern that already exists; they do not allow the web to adapt to the way it is used, as the learning web algorithms propose. However, the two methods can complement each other, as the use of connectivity matrices does not require these matrices to have only binary values (either there is a link or there is not). The learning web and other techniques will produce less sparse matrices with numerical values that can be analysed in the same way, but are likely to produce more fine-grained and reliable results.

Various Links on Web Connectivity Analysis

The Clever Project: research at IBM Almaden based on Kleinberg's HITS method; see also: Jon Kleinberg's Homepage with several papers, including: Authoritative sources in a hyperlinked environment, in: Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
The PageRank algorithm is being used in the Google search engine, and is sketched in: Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web (Manuscript in progress), The Anatomy of a Large-Scale Hypertextual Web Search Engine , by S. Brin & L. Page, and in: Efficient Crawling Through URL Ordering, by J. Cho, H. Garcia-Molina & L. Page
the Web Archeology project at Digital Research
Xerox PARC UIR Webology: "information ecology" research by Pitkow, Pirolli and others, including the papers: Silk from a Sow's Ear: Extracting Usable Structures from the Web and Life, Death, and Lawfulness on the Electronic Frontier
WebQuery: Searching and Visualizing the Web Through Connectivity: a paper by J. Carriere and R. Kazman
Web Structure Analysis: a collection of links
Project Aristotle(sm): Automated Categorization of Web Resources: various links
Cybermetrics: a list of papers applying bibliometric (citation) methods to the web.
Information Retrieval and Information Extraction on the web: a very rich list of publications and other resources
Graph structure in the web, a paper by A. Broder et al., analysing the structure appearring from a huge crawl through hundreds of millions of pages
Quiver, proposes search engines based on the Spectral Filtering algorithms developed by Kleinberg

Author
F. Heylighen,

Date
May 31, 2000 (modified)
Mar 24, 1999 (created)

Home

Project Organization

Collaborative Knowledge Development