Add support for filecrushing on Elastic MapReduce #2

alexanderdean · 2013-04-07T10:26:20Z

Work-in-progress PR - do not pull yet

Hi @edwardcapriolo - this is an open pull request to add support for using filecrush on EMR.

There are three main things to fix:

Instantiating the right type of FileSystem
Fix the location of tmpDir - I think we should be referencing "${hadoop.tmp.dir}" rather than raw new Path("tmp/crush-" + UUID.randomUUID());
Replacing the fs.makeQualified(dir).toUri().getPath() pattern with something that doesn't strip important S3 bucket information
License is missing #1 is done, see PR. Add support for filecrushing on Elastic MapReduce #2 is doable. V2 creates job with only one reduce #3 is a bit harder - I am working through this for EMR, but might need some help from you to make sure my changes don't break filecrush on standard HDFS.

Hoping this is the start of a collaboration! We're really excited about filecrush here at Snowplow.

edwardcapriolo · 2013-04-07T18:34:58Z

It all looks good so far. Just let me know when you want me to merge.

alexanderdean · 2014-12-11T14:01:39Z

We ended up not using this library in the end. :-) You can merge as-is if you like, or close. I'll delete our fork in a few days.

alexanderdean added 4 commits April 7, 2013 01:55

Changed setFilesystem to work with Amazon EMR/S3 paths as well

d5a85f4

Fixed --input-format and --output-format CLI options

5173736

Version bump

f638179

Fixed the filesystem lookups in the other files

78d2bd7

alexanderdean mentioned this pull request May 9, 2013

Consolidate small files prior to running ETL job snowplow/snowplow#207

Closed

Provide feedback