Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
One of the scalability problem we saw is when processing huge data, we need to have large number of reduce splits, which makes the memory overhead of shuffle writers becomes the bottleneck as stated in #685. More reduce splits we have, more stream objects we have in shuffle writers, the memory overhead of the internal buffers of the steam objects and file descriptor would become the bottleneck. Also as stated in https://spark-project.atlassian.net/browse/SPARK-751, large number of small blocks also makes the perf much worse.
The essential reason why we need to break down to many pieces is in the reduce side combining all the data of one partition need to be put into a hash map, this map is hold in memory through the whole process. In this patch, we compress the hash map for combine. The compression ratio for our production data can be around 30x so enabling compression in reduce side combination can significantly reduce the memory footprint thus reduce the number of reduce splits needed.
We tested the overhead of compression:
540M data in 4-node cluster, 1GB ram each node, the testing process is a simple groupBy followed by (s => s +" "). 399s without compression, 414s with compression. So just around 3.5% overhead.