Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for filecrushing on Elastic MapReduce #2

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

alexanderdean
Copy link

Work-in-progress PR - do not pull yet

Hi @edwardcapriolo - this is an open pull request to add support for using filecrush on EMR.

There are three main things to fix:

  1. Instantiating the right type of FileSystem
  2. Fix the location of tmpDir - I think we should be referencing "${hadoop.tmp.dir}" rather than raw new Path("tmp/crush-" + UUID.randomUUID());
  3. Replacing the fs.makeQualified(dir).toUri().getPath() pattern with something that doesn't strip important S3 bucket information
    License is missing #1 is done, see PR. Add support for filecrushing on Elastic MapReduce #2 is doable. V2 creates job with only one reduce #3 is a bit harder - I am working through this for EMR, but might need some help from you to make sure my changes don't break filecrush on standard HDFS.

Hoping this is the start of a collaboration! We're really excited about filecrush here at Snowplow.

@edwardcapriolo
Copy link
Owner

It all looks good so far. Just let me know when you want me to merge.

@alexanderdean
Copy link
Author

We ended up not using this library in the end. :-) You can merge as-is if you like, or close. I'll delete our fork in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants