|
|
Additional Factors Influencing PageRank |
|
|
|
It has been widely discussed if additional
criteria beyond the link structure of the web have been
implemented in the PageRank algorithm since the scientific
work on PageRank has been published by Lawrence Page and
Sergey Brin. Lawrence Page himself outlines the following
potential influencing factors in his patent specifications for
PageRank:
- Visibility of a link
- Position of a link within a document
- Distance between web pages
- Importance of a linking page
- Up-to-dateness of a linking page
First of all,
the implementation of additional criteria in PageRank would
result in a better approximation of human usage regarding the
Random Surfer Model. Considering the visibility of a link and
its position within a document implies that a user does not
click on links completely at haphazard, but rather follows
links which are highly and immediately visible regardless of
their anchor text. The other criteria would give Google more
flexibility in determing in how far an inbound link of a page
should be considered important, than the methods which have
been described so far.
Whether or not the above
mentioned factors are actually implemented in PageRank can not
be proved empirically and shall not be discussed here. It
shall rather be illustrated in which way additional
influencing factors can be implemented in the PageRank
algorithm and which options the Google search engine thereby
gets in terms of influencing PageRank values. |
|
|
Modification of the PageRank Algorithm
|
|
|
|
To implement additional factors in PageRank, the
original PageRank algorithm has again to be modified. Since we
have to assume that PageRank calculations are still based on
numerous iterations and for the purpose of short computation
times, we have to consider to keep the number of database
queries during the iterations as small as possible. Therefore,
the following modification of the PageRank algorithm shall be
assumed:
PR(A) = (1-d) + d (PR(T1)×L(T1,A) + ... +
PR(Tn)×L(Tn,A))
Here, L(Ti,A) represents the
evaluation of a link which points from page Ti to page A.
L(Ti,A) withal replaces the PageRank weighting of page Ti by
the number of outbound links on page Ti which was given by
1/C(Ti). L(Ti,A) may consist of several factors, each of them
having to be determined only once and then being merged to one
value before the iterative PageRank calculation begins. So,
the number of database queries during the iterations stays the
same, although, admittedly, a much larger database has to be
queried at each step in comparison to the computation by use
of the original algorithm, since now there is an evaluation of
each link instead of an evaluation of pages (by the number of
their outbound links). |
|
|
Different Evaluation of Links within a
Document
|
|
|
|
Two of the criteria for the evaluation of links
mentioned by Lawrence Page in his PageRank patent
specifications are the visibilty of a link and its position
within a document. Regarding the Random Surfer Model, those
criteria reflect the probability for the random surfer
clicking on a link on a specific web page. In the original
PageRank algorithm, this probability is given by the term
(1/C(Ti)), whereby the probability is equal for each link on
one page.
Assigning different probabilities to each
link on a page can, for instance, be realized as follows:
We take a look at a web
consisting of three pages A, B anc C, where each of these
pages has outbound links to both of the other pages. Links are
weighted by two evaluation criteria X and Y. X represents the
visibility of a link. X equals 1 if a link is not particularly
emphasized, and 2 if the link is, for instance, bold or
italic. Y represents the position of a link within a document.
Y equals 1 if the link is on the lower half of the page, and 3
if the link is on the upper half of the page. If we assume a
multiplicative correlation between X and Y, the links in our
example are evaluated as follows:
X(A,B) × Y(A,B) = 1
× 3 = 3
X(A,C) × Y(A,C) = 1 × 1 = 1
X(B,A) × Y(B,A) = 2
× 3 = 6
X(B,C) × Y(B,C) = 2 × 1 = 2
X(C,A) × Y(C,A) = 2
× 3 = 6
X(C,B) × Y(C,B) = 2 × 1 = 2
For the purpose
of determinig the single factors L, the evaluated links must
not simply be weighted by the number of outbound links on one
page, but in fact by the total of evaluated links on the page.
Thereby, we get the following weighting quotients Z(Ti) for
the single pages Ti:
Z(A) = X(A,B) × Y(A,B) + X(A,C) ×
Y(A,C) = 4
Z(B) = X(B,A) × Y(B,A) + X(B,C) × Y(B,C) =
8
Z(C) = X(C,A) × Y(C,A) + X(C,B) × Y(C,B) = 8
The
evaluation factors L(T1,T2) for a link which is pointing from
page T1 to page T2 are hence given by
L(T1,T2) =
X(T1,T2) × Y(T1,T2) / Z(T1)
Their values regarding our
example are as follows:
L(A,B) = 0.75
L(A,C) =
0.25
L(B,A) = 0.75
L(B,C) = 0.25
L(C,A) =
0.75
L(C,B) = 0.25
At a damping factor d of 0.5, we
get the following equations for the calculation of PageRank
values:
PR(A) = 0.5 + 0.5 (0.75 PR(B) + 0.75
PR(C))
PR(B) = 0.5 + 0.5 (0.75 PR(A) + 0.25 PR(C))
PR(C)
= 0.5 + 0.5 (0.25 PR(A) + 0.25 PR(B))
Solving these
equations gives us the follwing PageRank values for our
example:
PR(A) = 819/693
PR(B) = 721/693
PR(C) =
539/693
First of all, we see that page A has the
highest PageRank of all three pages. This is caused by page A
receiving the relatively higher evaluated link from page B as
well as from page C.
Furthermore, we see that even by
the evaluation of single links, the sum of the PageRank values
of all pages equals 3 (2079/693) and thereby the total number
of pages. So, the PageRank values computed by our modified
PageRank algorithm can be used for the general ranking of web
pages by Google without any normalisation being needed.
|
|
|
Different Evaluation of Links by Page Specific
Criteria |
|
|
|
Besides the unequal evaluation of links within a
document, Lawrence Page mentions the possibility of evaluating
links according to criteria which are based upon the linking
page. At first glance, this does not seem necessary since it
is the main principle of PageRank to rank pages the higher,
the more high ranking pages link to them. But, at the time of
their scientific work on PageRank, Page and Brin have already
recognized that their algorithm is vulnerable to artificial
inflation of PageRank.
An artificial influence on
PageRank might be exerted by webmasters who generate a
multitude of web pages whose links distribute PageRank in a
way that single pages within that system receive a special
importance. Those pages can have a high PageRank without being
linked to from other pages with high PageRank. So, not only
the concept of PageRank is undermined, but also the search
engine's index is spammed with an innumerable amount of web
pages which were solely created to influence PageRank.
In his patent specifications for PageRank, Lawrence
Page presents the evaluation of links by the distance between
pages as a means to avoid the artificial inflation of
PageRank, because the bigger the distance between two pages,
the less likely has one webmaster control over both. A
criterium for the distance between two pages may be if they
are on the same domain or not. In this way, internal links
would be weighted less than external links. In the end, any
general measure of the distance between links can be used to
determine such a weighting. This comprehends if pages are on
the same server or not and also the geographical distance
between servers.
As another indicator for the
importance of a document, Lawrence Page mentions the
up-to-dateness of the documents which link to it. This
argument considers that the information on a page is less
likely outdated, the more pages which have been modified
recently link to it. In contrast, the original PageRank
concept, just like any method of measuring link popularity,
favours older documents which gained their inbound links in
the course of their existence and have at a higher probability
been modified less recently than new documents. Basically,
recently modified documents may be given a higher evaluation
by weighting the factor (1-d). In this way, both those
recently modified documents and the pages they link to receive
a higher PageRank. But, if a page has been modified recently,
is not necessarily an indicator for the importance of the
information presented on it. So, as suggested by Lawrence
Page, it is advisable not to favour recently modified pages
but only their outbound links.
Finally, Page mentions
the importance of the web location of a page as an indicator
of the importance of its outbound links. As an example for an
important web location he names the root page of a domain,
but, in the end, Google could exert influence on PageRank
absolutely arbitrarily.
To implement the evaluation of
the linking page into PageRank, the evaluation factor of the
modified algorithm must consist of several single factors. For
a link that points from page Ti to page A, it can be given as
follows:
L(Ti,A) = K(Ti,A) × K1(Ti) × ... × Km(Ti)
where K(Ti,A) is the above presented weighting of a
single link within a page by its visibility or position.
Additionally, an evaluation of page Ti by m criteria which are
represented by the factors Kj(Ti) takes place.
To
implement the evaluation of the linking pages, not only the
algorithm but also the proceedings of PageRank calculation
have to be modified. This shall be illustrated by an example.
We take a look at a web consisting of three pages A,
B and C, whereby page A links to the pages B and C, page B
links to page C and page C links to page A. The outbound links
of one page are evaluated equally, so there is no weighting by
visibilty or position. But now, the pages are evaluated by one
criterium. In this way, an inbound link from page C shall be
considered four times as important as an inbound link from one
of the other pages. After weighting by the number of pages, we
get the following evaluation factors:
K(A) =
0.5
K(B) = 0.5
K(C) = 2
At a damping factor d of
0.5, the equations for the computation of the PageRank values
are given by
PR(A) = 0.5 + 0.5 × 2 PR(C)
PR(B) =
0.5 + 0.5 × 0.5 × 0.5 PR(A)
PR(C) = 0.5 + 0.5 (0.5 PR(B) +
0.5 × 0.5 PR(A))
Solving the equations gives us the
follwing PageRank values:
PR(A) = 4/3
PR(B) =
2/3
PR(C) = 5/6
At the current modifications of the
PageRank algorithm, the accumulated PageRank of all pages no
longer equals the number of pages. The reason therefore is
that the weighting of the page evaluation by the number of
pages was not appropriate. To determine the proper weighting,
the web's linking structure would have to be anticipated,
which is not possible in case of the actual WWW. Therefore,
the PageRank calculated by an evaluation of linking pages has
to be normalized if there shall not be any unfounded effects
on the general ranking of pages by Google. Within the
iterative calculation, a normalization would have to take
place after each iteration to minimize unintentional
distortions.
In the case of a small web, the
evaluation of pages often causes severe distortions. In the
case of the actual WWW, these distortions should normally
equalise by the number of pages. Indeed, it is to be expected
that the evaluation of the distance between pages will cause
distortions on PageRank, since pages with many inbound links
surely tend to be linked to from different geographical
regions. But such effects can be anticipated by experience
from previous calculation periods, so that a normalisation
would only have to be marginal.
In either case,
implementing additional factors in PageRank is possible.
Indeed, the computation of PageRank values would take more
time. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PageRank and Google are trademarks of Google Inc.,
Mountain View CA, USA.
PageRank is protected by US Patent
6,285,999.
The content of this document may be
reproduced on the web provided that a copyright notice is
included and that there is a straight HTML hyperlink to the
corresponding page at pr.efactory.de in direct
context. |
|
|
|
|
|