When Math Meets Practicality: Google's PageRank with Linear Algebra Tutorial: Part 1

PUBLISHED ON DEC 20, 2017 — DEVELOPMENT, MATHEMATICS, TUTORIAL

We often learn math and are not provided any practical explanation as to how it can be useful. When I learned about eigendecomposition, the math was interesting but I couldn’t see how it was useful. Why did I need to learn this aside from intellectual curiosity? I began to research applications of eigendecomposition.

Turns out one of the most ubiquitous pieces of tech in the world - Google, utilizes linear algebra for its ranking system in something called PageRank. Interested in the details, I found a few articles about how it worked.

This is a research paper on Google’s PageRank algorithm with three purposes:

   1. To serve as a basic tutorial to how Google’s PageRank algorithm works.
 2.To demonstrate a practical application of linear algebra (specifically eigendecomposition). Therefore, there will be a section with a decent amount of math.
   3. To provide a playground for you to play with the data yourself.

Part 1 will discuss the most simple example of PageRank. In the next article, we will go into more challenging examples of PageRank and discuss the algorithm in more detail.

Table of Contents:

The table of contents is as follows:

Part 1: Exploring a simple world wide web

 1. A Multi-Billion Dollar Idea
 2. Considerations in Building an Indexer
 3. Representing the Web
 4. Taking the most Explanatory Eigenvector
 5. Getting the PageRank Vector
 6. Using the Eigenvector for PageRank

Part 2: Advanced cases

 1. Dangling Nodes
 2. Damping Factor
 2. The Weaknesses of PageRank
 4. Play with a larger network
 5. Additional Topics not Adressed
 6. Additional Sources

Note: I am also doing this to learn for myself. I am by no means the expert on Google’s PageRank. If you feel you can meaningfully contribute to the article, please email me andor@henosisknot.com and I will update the page making note to your contribution.

A Multi-Billion Dollar Idea

What is PageRank?

Everytime you search Google, Google has a score on a page that designates the order at which each page shows up in your query. This order can be manipulated, but ultimately relies on a “rank” that is assigned to each page on the web spanning millions of values. Ranking the web was not a trivial task. Larry Page and Sergey Brin came up with the idea ~1996 and you can see the original paper here on Stanford’s site.

As you know, Larry Page and Sergey Brin’s modest proposal ended up being the most utilized search engine on the web and set the foundations for the second largest company in the world (Now spun off into Alphabet). Many would argue the PageRank algorithm is the core of Google. Since the founding in 1996, the algorithm has most likely gone through many iterations to improve the relevancy of a search.

Over a score years ago, Page and Brin had to brainstorm about how to make this search engine and there were a few considerations that were really important in the design of the algorithm. Below is the basic considerations.

Note: I found it confusing at the beginning this doesn’t rely on your actual query you search on google . PageRank is the general index for all pages on the web.

We want to build something to index. What do we need to consider?

Before we start rebuilding Google, we need to think about a few things:

The size of the web is very large

When you are talking about something like the web, you need to think big. Like billions of pages big. According to Google, there are at least 4.62 billion pages on the web. Keeping track of that all can be a daunting process. Anything that can index and rank the web needs to be able to scale well.

The point of PageRank is to represent the likelihood that a person will arrive at a page given that they are crawling through the web and outputs it as a probability distribution.

This should be considered in both the indexing and retrieval of the index. Google’s extremely proud of their speeds. Below the search bar you can see the query time for each request at each search and it often is around a few milliseconds.

Ranking needs to be relevant or people won’t use it

We also need to consider how relevant the ranking system is. I.e What prioritizes website A vs. Website B in the rank of the search? Is it the most visited websites that should get shown? What about the most cited websites? If the website contains a lot of outlinks (links toward other pages) should it be prioritized ahead of another site or should it be penalized?

Cheaters will try to exploit the system

Whatever system you create, you need to make sure that nobody can completely “cheat” the system to optimize their website. We need to make sure whatever we do does not allow abuse of the system.

A WebGraph: Representing the Web

One of the first problems that we need to consider is how to represent the web. Most implementations of modeling the web use a directed graph in which the edges are links and the nodes are web pages.

A simple universe

To begin with: Let’s pretend that the web was extremely simple and small. Let’s say that the entire web had only 4 web pages. PageA,PageB,PageC,and PageD. This is our web universe – let’s call it U1. Sound good so far?

Still on creating this universe U1, let’s say that each webpage has at least one link (this is important) and that the link cannot be toward itself.

Let’s draw this out and see what it looks like:

U1: A Simple Web Universe

Here's an example of a very simple webverse. Hover or drag the pages to see the links it points to

Good start. We have a basic graph representation of U1. As you can see, Page A is linked to Page B and Page C. Page B, D, and C all link to A. Now let’s represent it in some matrix form, counting the link dynamics per outlink to the webpage. Remember, an outlink is when a page references another page and so each link should get 1 point.

If you look at the matrix, now we have some basic point system for more relevant web pages and less relevant web pages. Simply put, pages with more points toward it seem to be more popular, and pages with less pages are less popular. More points means greater rank. Less points means less rank. In this case, [C,D,and E] are all pointing to A, so A has the greatest centrality of all the nodes.

Let’s tally up the amount of inlinks really quickly:

Inlink Tally

Page Count
A 3
B 1
C 1
D 0

The idea that Page and Brin thought was that more relevant pages would have high centrality on the web. In this example, Page A has the most links moving toward it, so should receive the highest score.

At a first glance this seems like a legitamte method to figure out which pages to rank before others but there actually is a lot of problems with this. For example:

Problem 1: Unfair influence. A page with a lot of links has more influence on the graph than a page with less links. What does that mean? I can influence the ranking system by posting a whole bunch of outlinks on my page.

Problem 2: Impractical scoring. This score isn’t actually a very useful score. All it’s doing is counting the links to each other. There should be a more useful way to represent which page is more or less important.

Let’s Make it into a Markov Chain!

For a start: Let’s make this graph into something called a Markov Chain. A decent explanation of a Markov Chain can be found here. If you aren’t aware what they are, here’s the gist:

Markov Chains represent probabilities between state transitions. In this case, it represents going from one page to another page.

Consider there are 4 states: A,B,C,and D. At State A the allowable transitions are to B,C,and D. At any state, the probability of moving to another state must be [0,1]. The sum of all the probilities will equal 1.

In the case of a web: Let’s say I’m on Page A. There is some x percentage I will go to Page B, another x percentage of going to Page C, and another x percentage of going to Page D. In fact, because it is a Markov Matrix, from the perspective of Page A, the probability of moving to another page looks like this:

$ P_{A \rightarrow B} + P_{A \rightarrow C} + P_{A \rightarrow D} = 1 $

Meaning that the probability someone will go to any next page will equal 1. Because I have links to pages B and C, I have a 50% change of going to B, a 50% chance of going to B, and 0% chance of going to C.

Here’s a better notation of the probability distribution:

$$ s_{i} \in S \sum_{j=0}^{n-1} P_{i,j} = 1 $$ Where:
n = the total states in the network
s = state (or node)
S = The state space
P = Probability going from state i to state j

All this says is that for any page the sum of probabilities to another page are 1.

Let’s assume each outlink has an equal weight to direct to another webpage. To ensure that the values total up to 1, we will need to divide the directed link by the total amount of links on the page. So, if a page had 3 links, divide by 3. If it has 4 links divide by 4.

Here’s a new formula: $1 = \frac{S_B}{L_A} + \frac{S_C}{L_A} + \frac{S_D}{L_A}$

Where $L_A$ is the total links in A and $S_i$ is the count of outlinks to $i$. This formula will give us our basic column probability vector. We call this column stochastic which is just fancy way of saying that matrix entries cannot be non-negative and the sum must equal to 1.

The simplified formula for this can be viewed as such:

$$ s_{i} \in S \sum_{j=0}^{n-1} L_{i,j}/L_{s} = 1 $$
Where:
n = the total states
s = state (or node)
S = The state space
$ L_{i,j} $ = The count of links from i to j.
$ L_{s} $ = the total links going out of a state

Now look at the graph:

Add the columns. See how the columns add up to 1? We are on a roll.

Note: Each column we can also describe as an outlink probability vector centered from the column origin. Take note of the fact that over a lot of pages, you expect the matrix to look pretty sparse. This has its advantages later on when calculating the eigenvector for a large scale matrix.

Simplified PageRank Algorithm

In it’s most basic form, here’s the simplified PageRank algorithm that we are going to be working toward.

   1. Generate a network of size NxN where columns are stochastically initalized at random (or equal) values.
   2. Replace x by Ax until it converges (should be generally fast)

I’ll explain in a later section, but the basic premise it to keep crawling the web and iterating the rank until you get the optimal policy. Ideally, a new webcrawl will not make any impact to the ranking.

Now we need to get into some Linear Algebra concepts. If you don’t want to pay attention to any of this, feel free to skip down to the section detailing the Power Iteration.

The Basic Premise: Taking The Most Explanatory Eigenvector

At the most fundamental level, the Google algorithm relies on taking an eigenvector associated with the dominant left eigenvalue and using that to calculate the PageRank Score.

Let’s look at this problem again. Let’s look at the below matrix equation:

$$ \begin{bmatrix} 0 & 1 & 1 & 1\\ .5 & 0 & 0 & 0\\ .5 & 0 & 0 & 0\\0& 0 & 0 & 0\end{bmatrix} \cdot \begin{bmatrix} P_A \\ P_B \\ P_C \\ P_D \end{bmatrix} = \begin{bmatrix} P_A \\ P_B \\ P_C \\ P_D \end{bmatrix} $$

And here it is set up as a linear set of equations:

$$
P_A = P_B + P_C + P_D \\ P_B = .5P_A \\ P_c = .5P_A \\ P_D = 0 $$

What does this equation look like? Something pretty similar to the Eigenvectors!

Eigenvector Equation $$ A \vec v = \lambda \vec v $$

Where $\lambda$ is a scalar value that when eigen vector v is multiplied it = to $A \vec v$.

On a math level: Why is this powerful?

Eigenvectors and Eigenvalues are super powerful because they provide the subspace at which a linear transformation of A only changes in a scalar fashion. They are the roots or basis of a linear transformation.

If you’re familiar with SVD, unlike SVD, eigenvectors need not been orthonormal (unlike SVD), however you can normally find orthonormal basis for SVD using the Spectral Theorem (not explained here). Eigenvectors and values can be used to find stable states of linear transformations, and have a wide variety of applications apart from Google such as image compression, stability analysis, solving diffeq’s, and facial recognition.

Note: In a stochastic matrix, all the eigenvalues will be real, however it is possible for them to be complex in difference matrix representations.

Using Eigenvalues with PageRank

When we do eigendecomposition, we get two things:

   1. A set of eigenvectors
   2. A set of eigenvalues

With the eigenvalues corresponding to the relative eigenvector. The higher the eigenvalue, the more explanatory the corresponding eigenvector is in explaining the data!

Let’s get that PageRank Vector!

Ok. So for now we are going to get the most explanatory eigenvector. To do that, we need to do eigendecomposition. Stay with the math here if you can, but if not then just move on to the next section.

Getting the Eigenvalues $$ |A - \lambda \textit{I} | = \begin{bmatrix} 0 - \lambda & 1 & 1 & 1\\ .5 & 0 - \lambda & 0 & 0\\ .5 & 0 & 0 - \lambda & 0\\0& 0 & 0 & 0 - \lambda\end{bmatrix} = 0 $$

Let’s take the determinate now:

Let’s start with the first column $$ (0-\lambda) * Det(\begin{bmatrix} 0 - \lambda & 0 & 0\\ 0 & 0 - \lambda & 0\\ 0 & 0 & 0 - \lambda\end{bmatrix}) = \lambda^4 $$

Second column $$ 1 * Det(\begin{bmatrix} .5 & 0 & 0\\ .5 & 0 - \lambda & 0\\ 0 & 0 & 0 - \lambda\end{bmatrix}) = .5 \lambda^2 $$

Third column

$$ 1 * Det(\begin{bmatrix} .5 & 0 - \lambda & 0\\ .5 & 0 & 0\\ 0 & 0 & 0 - \lambda \end{bmatrix}) = -.5 \lambda^2 $$

Fourth column $$ 1 * Det(\begin{bmatrix} .5 & 0 - \lambda & 0\\ .5 & 0 & 0 - \lambda\\ 0 & 0 & 0 \end{bmatrix}) = 0 $$

Result:

$$ \lambda^4 - .5 \lambda^2 + .5\lambda^2 - 0 = \lambda^4 - \lambda^2 = 0 $$

Solve:

$$ \lambda^2(\lambda^2 - 1) = \lambda^2(\lambda + 1)(\lambda - 1) = 0 $$

$$ \lambda = {1, -1, 0} $$

There we go! We now have our eigenvalues for problem. Now we need to get the corresponding eigenvector.

Getting the EigenVectors

Let’s start by getting the corresponding eigenvector at 1.

$$ A \vec v = \lambda \vec v $$

This was our matrix: $$ |A - \lambda \textit{I} | = \begin{bmatrix} 0 - \lambda & 1 & 1 & 1\\ .5 & 0 - \lambda & 0 & 0\\ .5 & 0 & 0 - \lambda & 0\\0& 0 & 0 & 0 - \lambda\end{bmatrix} = 0 $$

Fill in values for $\lambda = 1$ and let’s call this Matrix B $$ \textbf{B} = |A - \textit{I} | = \begin{bmatrix} - 1 & 1 & 1 & 1\\ .5 & - 1 & 0 & 0\\ .5 & 0 & - 1 & 0\\0& 0 & 0 & - 1 \end{bmatrix} = 0 $$

$$ \textbf{B} \cdot \vec v = 0 $$ Where $\vec v$ is the 4 length vector $[v_1,v_2,v_3,v_4]$

Let’s once again put this as an example of a series of equations.

$$ -v_1 + v_2 + v_3 + v_4 = 0 \\
.5v_1 - v_2 = 0 \\
.5v_1 - v_3 = 0 \\
-v_4 = 0 \\
$$

Let’s use simple substitution here to solve things:

We know:

$v_4 = 0$

This simplifies the first row:

$$-v_1 + 2v_2 = 0 \
v_1 = 2v_2 $$

Let’s go back to the second row:

We now know the ratio: $[2,1,1,0]$ so an eigenvector would be anything on the subspace $\vec e_1 = [2,1,1,0]$. To make it into an orthonormal basis, we probably want to also normalize the vector into a unit vector.

This is the formula to convert into a unit vector.

$$ ||v|| = \frac{\vec v}{\sqrt{2^2 + 1^2 + 1^2}} = \frac{\vec v}{\sqrt{6}} = [.8164, .4082, .4082, 0] $$

Awesome! This is the eigenvector associated with eigenvector 1. Coincidentally, it will be the eigenvector we use.

Note: Markov Matrices will always have a max eigenvalue of 1. If you’re interested read more on the Perron–Frobenius theorem

Using the Eigenvector for PageRank

We’ve taken the eigenvector for the dominant eigenvalue. What does it show?

Well, lets align the order of the eigenvector with the pages:

PageA PageB PageC PageC
Corresponding Eigenvector Component 0.8164 0.4082 0.4082 0

Now, because all the links were pointed toward A, we expected that A would have the highest score. Followed by B and C having less (but equal) score. D has no links to it, so it would have the lowest score.

Let’s order them and rank them by most to least:

The Page Rank for U1

PageA PageB PageC PageC
Rank 1 2 2 4

Finally we have a rank! Let’s look more closely at the rank though…

Uh oh! Looks like Page B and C are equivalent. That’s expected because they both have 1 link toward it, but presents a problem. If we always rank Page B over Page C, then that wouldn’t be fair for Page C. Yet, we have equal eigenvalues. What do we do?

Well…that’s where Google gets fancy and uses things like hypertext analysis to figure out which to present above the other. For now, let’s just remember that this is a potential issue in providing a valid and fair PageRank and that PageRanking is a difficult task to do correctly.

Calculating the eigenvector over a large dataset: Power Method

Phew. That was a lot of work for a very small web universe! How would we do the dominating eigenvector for something more complicated. I.e 10 billion pages?

Well, that’s where the Power Method comes in.

What is the Power Method?

The power method is an iterative method to allow you to calculate eigenvectors over very large datasets. It’s cheap, with a $O(n^3)$ computation speed, relies on a single vector, and while it takes some time to converge it is probably the best way to handle something like the web.

Where does the Power Method shine? When matrices are sparse. Guess what? When you span the web and have billions of entries, you get a really sparse matrix !

Let’s start off with the objective of the power method. Ultimately we want an optimal $\pi$ regarding the ranking of pages.

$\vec{p}$ = PageRank vector
$\pi^T$ = Markov Matrix at iteration t

Want: $\pi$ where $\pi^T \vec{p} = \pi^{T+1}$

At this point, we have an $\vec{p}$ vector that converges and we have our PageRank. If you want to think about it in another way, an update doesn’t change the result.

Pause for a second.

Imagine for a second you are Larry Page and Sergey Brin. You are thinking about how to take this big giant web and to converge everything to an optimal policy. It’s too big to do in one go, and moreover it is rapidly changing. You decide to pick an initial guess of the optimal policy and set everything to that guess. It probably isn’t right, but since we don’t know the probabilities beforehand its the best we can do to start.

We then build the stochastic matrix we did before, and then multiply it by the current pagerank vector giving us a new pagerank vector.

What’s cool about this? For one: It’s super cheap to do this operation. The amount of operations is proportional to the amount of non-zero items within A. That’s scalable, which is great.

Of course, as we try to converge we will have errors in convergence however the error should generally decrease over time. I won’t go into details in this article about the sensitivity of the Power Method, but convergence is generally insensitive to changes in parameters and is not dependent on the size of the Matrix.

Proof of convergence

This section can be omitted if you want. It explains why we can expect convergence using the power iteration.

I’m going to spit out one of the cooler proofs I came across in that it’s easy to understand and if you’re familiar with digitalization of matrices then this will seem simple.

Let $P$ be a eigenvector equal to the size of the matrix A. A can be represented as:

$$ \textbf{A} = \textbf{P} \textbf{D} \textbf{P}^{-1} $$

Let’s represent the power iteration through diagonalization. Note that a lot of these will cancel out as they iterate over each other because it’s the $AA^{-1}$

$$ \textbf{A} \vec{p} = (\textbf{P} \textbf{D} \textbf{P}^{-1})^k * \vec{p} = (\textbf{P} \textbf{D} \textbf{P}^{-1})(\textbf{P} \textbf{D} \textbf{P}^{-1})(\textbf{P} \textbf{D} \textbf{P}^{-1})… * \vec{p} = (\textbf{P} \textbf{D}^{k} \textbf{P}^{-1}) * \vec{p} $$

Where D is a diagonalized matrix consistent of the eigenvalues of A and A is an invertible matrix.

Now remember that we are only looking at the left most dominant eigenvalue. That’s because the series will converge overtime with the dominant eigenvector preserving most of the information of the network. As k iterates to $\infty$ the other eigenvalues will converge to zero!

The result: The Markov Matrix multiplied by the dominant eigenvector overtime will eventually converge to the optimal eigenvector. Now we can rank a very large network.

At this point we have an ability to generate a PageRank on a simple network without much issues. This will break down if the network gets more complex. If you want you can just call it a day right here and say you know the basics of how PageRank works. If you want to learn more, go to Part 2[release next week] where we will explore concepts like dangling nodes and closed networks.