-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.md.save
36 lines (22 loc) · 1.44 KB
/
README.md.save
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# WarcToSpark
Spark Java Program to extract HTML contents from WARC archives for the estnltk project (https://github.com/peepkungas/estnltk-openstack-spark).
Contains two different Spark jobs:
1. A standalone batch proccessing jo: org.ut.scicloud.estnltk.SparkWarcReader
2. Spark Straming job: org.ut.scicloud.estnltk.SparkWarcStreamingReader
Output is in the Hadoop SequenceFile format and contains the following values: \
Text key - protocol + "::" + hostname + "::" + urlpath + "::" + parameters + "::" + date ("yyyymmddhhmmss") \
Text val - raw html string
To make it compatible with Python based Spark streaming, we have to use a custom Key and Value separator, which should not be found in any of the HTML documents. \
Currently, this separator is: Define a Custom Key & Value separator, so it would not conflict with any actual string in the extracted HTML
Command to test streaming job locally:
spark-submit --master local[4] \
--jars lib/warcutils-1.2.jar,\
lib/jwat-archive-common-1.0.5.jar,\
lib/jwat-common-1.0.5.jar,\
lib/jwat-warc-1.0.5.jar \
--class org.ut.scicloud.estnltk.SparkWarcStreamingReader \
warcToSpark.jar input_folder output_folder
NB! When working with HDFS, input files have to be moved not copied!
Suggested behaviour is to first copy input files to a temporary folder and then move them to the input folder of the streaming job:
hadoop dfs -copyFromLocal /mnt/warcs/warc1.warc inputToCopy
hadoop dfs -mv inputToCopy/* input