WarcToSpark

Spark Java Program to extract HTML contents from WARC archives for the estnltk project (https://github.com/peepkungas/estnltk-openstack-spark).

Contains two different Spark jobs:

A standalone batch proccessing job: org.ut.scicloud.estnltk.SparkWarcReader
Spark Straming job: org.ut.scicloud.estnltk.SparkWarcStreamingReader

Output is in the Hadoop TextFile format and contains the following values:
Text key - protocol + "::" + hostname + "::" + urlpath + "::" + parameters + "::" + date ("yyyymmddhhmmss")
Text val - raw html string

To make the output of the program compatible with text based Python Spark streaming, all linebreaks are removed from the HTML to make it a single line string. \

Command to test streaming job locally:

spark-submit --master local[4]
--jars lib/warcutils-1.2.jar,
lib/jwat-archive-common-1.0.5.jar,
lib/jwat-common-1.0.5.jar,
lib/jwat-warc-1.0.5.jar
--class org.ut.scicloud.estnltk.SparkWarcStreamingReader
warcToSpark.jar input_folder output_folder

NB! When working with HDFS, input files have to be moved not copied!

Suggested behaviour when processing already existing files in HDFS is to first copy input files to a temporary folder and then move them to the input folder of the streaming job:

hadoop dfs -copyFromLocal /mnt/warcs/warc1.warc inputToCopy

hadoop dfs -mv inputToCopy/* input

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
lib		lib
src/org/ut/scicloud/estnltk		src/org/ut/scicloud/estnltk
.classpath		.classpath
.project		.project
README.md		README.md
README.md.save		README.md.save
README.md~		README.md~
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WarcToSpark

About

Releases

Packages

Languages

pjakovits/WarcToSpark

Folders and files

Latest commit

History

Repository files navigation

WarcToSpark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages