-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the CombineFileInputFormat to avoid too many mappers #47
base: develop
Are you sure you want to change the base?
Conversation
Do you take advantage of the consolidate functions on your pails ever? I personally never ran into an issue with too many small files because I always ensure that my master pails are consolidated before I run my hadoop jobs on them. |
I tried to use it, but I did not have access to the hardcoded /tmp directory. I see there is another PR to fix that problem though. Can you explain a bit more how that works? Does the data remain partitioned as it is in the master directory? Is the master directory replaced? |
The data remains partitioned as designed. files in each sub pail with the
On Thu, Apr 3, 2014 at 1:40 PM, Max Hansmire [email protected]:
|
* limitations under the License. | ||
*/ | ||
|
||
// This is straight up copy of the hadoop file, so that we can use extend from it without having to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa, copying and pasting big files like this is hugely frowned upon. Not a good way to handle API changes in hadoop. If they fix a bug this code would never see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see my comment in a previous change about this. I think it would be much better to upgrade the hadoop library, but not sure how to fix otherwise. Also not sure what that upgrade would mean for other users of this library.
Change the SequenceFilePailInputFormat to use the CombineFileInputFormat. This should reduce the number of input splits for Pail sources. In my tests, several thousand splits were reduced to one.
There is an issue with this change. It will not work with the hadoop 2.0.5-alpha, which is the version of hadoop that I have deployed. The reason is that the implementation of CombineFileInputFormat in that version does not call listStatus(JobConf conf) from the mapred package to get the list of files. Instead it calls ListStatus(JobContext conf) from the mapreduce package.
I fixed this by pulling in CombineFileInputFormat to avoid version conflicts.