PageRankCluster

This application would run on an HDFS cluster and output a list of webpages ranked in order of their calculated pageranks. The crawled data would be used from common crawl repository on AWS. The project uses Apache Spark's GraphX API. The sample input files are taken from hyperlink graph provided by Web Data Commons at http://webdatacommons.org/hyperlinkgraph/