Skip to content

Latest commit

 

History

History
4 lines (4 loc) · 381 Bytes

README.md

File metadata and controls

4 lines (4 loc) · 381 Bytes

PageRankCluster

This application would run on an HDFS cluster and output a list of webpages ranked in order of their calculated pageranks. The crawled data would be used from common crawl repository on AWS. The project uses Apache Spark's GraphX API. The sample input files are taken from hyperlink graph provided by Web Data Commons at http://webdatacommons.org/hyperlinkgraph/