Use the CombineFileInputFormat to avoid too many mappers #47

hansmire · 2014-04-03T06:54:13Z

Change the SequenceFilePailInputFormat to use the CombineFileInputFormat. This should reduce the number of input splits for Pail sources. In my tests, several thousand splits were reduced to one.

There is an issue with this change. It will not work with the hadoop 2.0.5-alpha, which is the version of hadoop that I have deployed. The reason is that the implementation of CombineFileInputFormat in that version does not call listStatus(JobConf conf) from the mapred package to get the list of files. Instead it calls ListStatus(JobContext conf) from the mapreduce package.

I fixed this by pulling in CombineFileInputFormat to avoid version conflicts.

sorenmacbeth · 2014-04-03T07:02:22Z

Do you take advantage of the consolidate functions on your pails ever? I personally never ran into an issue with too many small files because I always ensure that my master pails are consolidated before I run my hadoop jobs on them.

hansmire · 2014-04-03T20:40:13Z

I tried to use it, but I did not have access to the hardcoded /tmp directory. I see there is another PR to fix that problem though. Can you explain a bit more how that works?

Does the data remain partitioned as it is in the master directory? Is the master directory replaced?

sorenmacbeth · 2014-04-03T20:56:56Z

The data remains partitioned as designed. files in each sub pail with the
master pail are merged in place. you can configure the size of each
consolidated file as well.

Pail p = new Pail("/some/path");
p.consolidate();

On Thu, Apr 3, 2014 at 1:40 PM, Max Hansmire [email protected]:

I tried to use it, but I did not have access to the hardcoded /tmp
directory. I see there is another PR to fix that problem though. Can you
explain a bit more how that works?

Does the data remain partitioned as it is in the master directory? Is the
master directory replaced?

Reply to this email directly or view it on GitHubhttps://github.com//pull/47#issuecomment-39502028
.

http://about.me/soren

ianoc · 2014-10-08T15:50:24Z

dfs-datastores/src/main/java/com/backtype/hadoop/mapred/lib/CombineFileInputFormat.java

+ * limitations under the License.
+ */
+
+// This is straight up copy of the hadoop file, so that we can use extend from it without having to


Whoa, copying and pasting big files like this is hugely frowned upon. Not a good way to handle API changes in hadoop. If they fix a bug this code would never see it.

You can see my comment in a previous change about this. I think it would be much better to upgrade the hadoop library, but not sure how to fix otherwise. Also not sure what that upgrade would mean for other users of this library.

#45

Use the CombineFileInputFormat to avoid too many mappers

a1ed8b8

limit the number of files that you can have per split

039e77b

ianoc reviewed Oct 8, 2014
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the CombineFileInputFormat to avoid too many mappers #47

Use the CombineFileInputFormat to avoid too many mappers #47

hansmire commented Apr 3, 2014

sorenmacbeth commented Apr 3, 2014

hansmire commented Apr 3, 2014

sorenmacbeth commented Apr 3, 2014

ianoc Oct 8, 2014

hansmire Oct 8, 2014

Use the CombineFileInputFormat to avoid too many mappers #47

Are you sure you want to change the base?

Use the CombineFileInputFormat to avoid too many mappers #47

Conversation

hansmire commented Apr 3, 2014

sorenmacbeth commented Apr 3, 2014

hansmire commented Apr 3, 2014

sorenmacbeth commented Apr 3, 2014

ianoc Oct 8, 2014

Choose a reason for hiding this comment

hansmire Oct 8, 2014

Choose a reason for hiding this comment