forked from commoncrawl/commoncrawl-examples
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README-Amazon-AMI
68 lines (38 loc) · 1.78 KB
/
README-Amazon-AMI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Common Crawl Quick Start Amazon AMI
-----------------------------------
Welcome to the Common Crawl Quick Start Amazon AMI!
The Common Crawl corpus is a copy of billions of web documents and their
metadata, stored as an Amazon S3 Public Dataset and available for analysis.
Here are the steps you need to follow to run your first job against the
Common Crawl corpus:
1. Find your Amazon Access Credentials (Amazon Access ID & Amazon Secret Key)
and save as two lines in this file:
/home/ec2-user/.awssecret
For example:
JLASKHJFLKDHJLFKSJDF
DFHSDJHhhoiaGKHDFa6sd42rwuhfapgfuAGSDAjh
Change the permissions of this file to read/write only by 'ec2-user':
chmod 600 /home/ec2-user/.awssecret
Now you can use Tim Kay's AWS Command Line tool. Try this:
aws ls -1 aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/metadata-
If you are planning on using the local Hadoop cluster, you should also consider
setting these properties in /etc/hadoop/hadoop-site.xml:
fs.s3n.awsAccessKeyId
fs.s3n.awsSecretAccessKey
2. Move to the 'commoncrawl-examples' directory. Make sure it is up-to-date:
cd ~/commoncrawl-examples; git pull
3. Compile the latest example code:
ant
4. Run an example! Decide whether you want to run an example on the small local
Hadoop instance on on Amazon Elastic MapReduce.
Run this command to see your options:
bin/ccRunExample
then go ahead and run an example:
bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount
then look at the code:
nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java
Note: You need to have your own Amazon S3 bucket to run Amazon Elastic
MapReduce jobs.
-----------------------------------
You can read all of this again in $HOME/commoncrawl-examples/README-Amazon-AMI.
Have fun!