[FEA] Do a better job with clustering json_paths for multi-get_json_object #2312

revans2 · 2024-08-08T16:38:17Z

Is your feature request related to a problem? Please describe.
multi-get_json_object currently sorts the paths before chunking them up. This works fairly well in reducing thread divergence in the kernels, but it is not perfect.

The current kernel will process an input row per warp. Each warp can then handle up to 32 different threads/JSON paths. But because of memory limitations we often cannot match that. #2247 is to reduce this and if we do it enough we can increase the performance for some jobs significantly. But in my tests when we start to get large numbers of paths running in a warp the performance of sorting can be worse than what we did before, with hashing the paths.

I think we can fix all of this with some simple changes. Effectively the increased computation cost of adding a path to a warp is all about when does that path diverge from any other path in the warp. The divergence will happen at each level of processing a path. So in general we want to cluster the paths by prefix, which is why sorting works. There appears to be little if any added cost in running a diverging thread in the same warp vs another warp, but there can be a big win if we don't arbitrarily split two paths that are very close to each other.

The idea would be to create a tree of paths. We can then find a sub-tree with enough leaf nodes that it would fit in our budget/warp. This is kind of a bin-packing problem so we can be a bit greedy and start by looking at the largest sub-trees first and then filling it in with smaller branches.

revans2 added ? - Needs Triage performance feature request labels Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Do a better job with clustering json_paths for multi-get_json_object #2312

[FEA] Do a better job with clustering json_paths for multi-get_json_object #2312

revans2 commented Aug 8, 2024

[FEA] Do a better job with clustering json_paths for multi-get_json_object #2312

[FEA] Do a better job with clustering json_paths for multi-get_json_object #2312

Comments

revans2 commented Aug 8, 2024