FAQ
Hi Ted,

I am really interested in learning more about the symmetry in power
law/fractal graphs in general and how matrix math (SVD, etc.) allows us to
detect/exploit it. Would you have any recommendations for books or papers
that I could read which explores this question. I remember doing a standard
Linear Algebra course way back in college but it never discussed its links
to graphs/symmetry...

Thanks and best regards,
Dev


On Wed, Apr 16, 2008 at 2:33 AM, Ted Dunning wrote:


Power law algorithms are ideal for this kind of parallelized problem.

The basic idea is that hub and authority style algorithms are intimately
related to eigenvector or singular value decompositions (depending on
whether the links are symmetrical). This also means that there is a close
relationship to asymptotic beahavior of random walks on the graph.

If you represent the linkage in the web by a matrix that has columns
representing source page and rows representing the target page and with a
1
where-ever the source page has a link pointing to the target page, then if
you start with a vector with a single non-zero element equal to 1 as a
representation of a single page, then multiplying by the linkage matrix
will
give you a vector with 1 in the positions corresponding to the pages the
original page linked to. If you multiply again, you get all the pages
that
you can get to in two steps from the original page.

Mathematically, if we call the original vector x and the linkage matrix A,
the pages that x links to are just Ax. The pages that are two steps from
x
are A(Ax) = A^2 x.

The eigenvector decomposition of A is just a way of writing A as a product
of three matrices:

A = U S U'

U' is the transpose of U, and U has the special property that U'U = I (it
is
called ortho-normal because of this).

S is a diagonal matrix.

There is lots of deep mathematical machinery and beautiful symmetry
available here, but for now we can just take this as given.

The pages n steps from x are

x_n = A^n x = (U S U')^n x = (U S U')^n-2 (U S U') (U S U') x
= (U S U')^(n-2) (U S (U'U) S U') x = (U S U')^(n-2) (U S^2 U') x
= U S^n U' x

This is really cool because S^n can be computed by just taking each
diagonal
element and raising it to a power.

Eigenvector decompositions have other, really deep connections. For
instance, if you take the elements of S (call the i-th one s_i) then

sum_i s_i^n

Is the number of paths that are n steps long.

Connected (or nearly connected) clusters of pages can also be derived from
the eigenvector decomposition. This is the basis of so-called spectral
clustering. For some very impressive examples of spectral clustering see

http://citeseer.ist.psu.edu/ng01spectral.html

So eigenvectors are cool. But how can we compute them?

First, note that if A^n = U S^n U' and if some of the s_i are bigger than
others, the big ones will quickly dominate the others. That is pretty
quickly, A^n \approx u_1 s_1^n u_1'. This means that we can compute an
approximation of u_1 by just doing A^n x where x is some random vector.
Moreover, we can compute u_2 by starting with a different random vector
and
iterating the same way, but with an additional step where we forbid the
result from going towards u_1. With just a few additional wrinkles, this
gives us what is called the Lanczos algorithm. Golub and van Loan's
excellent book Matrix Computations gives a lot of information on these
algorithms.

The cool thing here is that our random vector can represent a single page
and we can approximate the final result by following links. Following
links
is just a (human-readable) way of saying sparse matrix multiplication. If
we do this multiplication against lots of different random starting
points,
we can quickly build parallel algorithms to compute things like page rank.

I hope this helps. I know it is a big byte to take at one time.






On 4/15/08 9:31 AM, "Chaman Singh Verma" wrote:

Hello,

After googling for many days, I couldn't get one answer from many of the
published reports on Ranking algorithm done by Google. Since Google uses
GFS for fault tolerance purposes, what communication libraries they might be
using to solve such a large matrix ? I presume that standard Message Passing
(MPI) may not be suitable for this purpose(Fault tolerance issue)

Suppose we implement ranking algorithm on top of Hadoop, what could be
the best way/best distributed algorithm/library etc ?

With regards.

Chaman Singh Verma
Poona, India




between 0000-00-00 and 9999-99-99

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 5 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 15, '08 at 4:32p
activeApr 16, '08 at 4:47a
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase