Skip to content

DhanushkiMapitigama/DataEngineering_Group7

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataEngineering_Group7

Data Uploading

To upload data to HDFS, use the following command, change the dates as needed:

wget -qO- https://files.pushshift.io/reddit/comments/RC_2011-08.zst | zstd --long=31 -d --stdout - | docker exec -i hadoop-namenode hdfs dfs -put - /reddit/RC_2011-08.json

This will download the data from the pushshift.io website, decompress it, and upload it to HDFS, all without storing anything in the middle as it is streamed. The data is stored in the /reddit directory on HDFS.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.6%
  • Shell 2.7%
  • Dockerfile 1.5%
  • Python 0.2%