Using PageRank

PageRank measures the importance of each vertex in a graph. PageRank was started by Google's founders, who used the theory that the most important pages on the Internet are the pages with the most links leading to them. PageRank also looks at the importance of a page leading to the target page. So, if a given web page has incoming links from higher rank pages, it will be ranked higher.

Getting ready

We are going to use Wikipedia page link data to calculate page rank. Wikipedia publishes its data in the form of a database dump. We are going to use link data from http://haselgrove.id.au/wikipedia.htm, which has the data in two files:

  • links-simple-sorted.txt
  • titles-sorted.txt

I have put both of them on Amazon S3 at s3n://com.infoobjects.wiki/links ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.