Split output batches of joins that do not respect batch size #12969

alihan-synnada · 2024-10-16T14:41:55Z

Which issue does this PR close?

Rationale for this change

A join operation chain can create a RecordBatch whose size is thousands or even millions of rows.

What changes are included in this PR?

Adds a new config called enforce_batch_size_in_joins that is disabled by default. Enabling the config restricts the maximum output batch size of join operators to batch_size. #12634 is similar but it splits the join indices and then builds the output batches, which causes performance issues. This PR splits the output batches after the join is processed.

Improves adjust_indices_by_join_type performance by optimizing PrimitiveArray concatenation using MutableArrayData

Are these changes tested?

Includes unit tests for BatchSplitter

Are there any user-facing changes?

Users can optionally enable enforce_batch_size_in_joins in cases where joins cause out-of-memory. No breaking changes

alamb · 2024-10-16T17:18:17Z

fyi @mhilton

alamb · 2024-10-16T17:18:21Z

I will review this later today

ozankabak

I reviewed this carefully and it looks good to me.

This doesn't cause any performance issues but solves the "growing batch sizes" problem in plans with cascaded joins

Dandandan · 2024-10-17T04:27:30Z

datafusion/physical-plan/src/joins/utils.rs

-                .iter()
-                .chain(right_unmatched_indices.iter())
-                .collect();
+            let mut new_right_indices = MutableArrayData::new(


I think we could use Vec<u64> and Vec<u32> instead of MutableArrayData

I tried the following code but it's about 200% slower. Is there a more optimized way to use Vec in this context?

// the new left indices: left_indices + null array let mut new_left_indices = Vec::with_capacity(left_size + unmatched_size); new_left_indices.extend(left_indices.values().iter().map(|v| Some(*v))); new_left_indices.append(&mut vec![None; unmatched_size]); let new_left_indices = UInt64Array::from(new_left_indices); // the new right indices: right_indices + right_unmatched_indices let mut new_right_indices = Vec::with_capacity(right_size + unmatched_size); new_right_indices.extend(right_indices.values().iter()); new_right_indices.extend(right_unmatched_indices.values().iter()); let new_right_indices = UInt32Array::from(new_right_indices);

Hm you're right, sorry I missed that it was producing nulls, in that case this is not very efficient indeed.

Seems though we can just use https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveBuilder.html then instead?

Thanks for the suggestion. I ended up using into_builder(), it's incredibly fast now.

Good news: It runs much faster with into_builder.
Bad news: It seems like we actually exercise the internal_err (i.e. ref count > 2 to the underlying buffer) in some cases. We will debug and finalize (hopefully) tomorrow.

Side note: I think we might be even better off in future to pass the (owned) primitive builders to the different methods instead of trying to convert the arrays to builders again, this probably saves some copies in other places as well.

mhilton · 2024-10-17T08:12:06Z

Unfortunately this doesn't address the actual problem with creating giant batches, which is they require a lot of memory and that memory isn't accounted for in any MemoryPool. Wiring a MemoryReservation into BatchSplitter would probably be enough to address this though.

ozankabak · 2024-10-17T08:19:25Z

Wiring a MemoryReservation into BatchSplitter would probably be enough to address this though.

Thanks for bringing this to our attention. We will add this in a quick follow-on PR.

alamb

This is interesting -- thank you @alihan-synnada and @ozankabak

I wonder if you have considered updating the join algorithms themselves to incrementally produce output (rather than generating one large RecordBatch and the slicing it up?)

We found in the GroupBy that the slicing requires non trivial time -- see #9562 and the POC by @Rachelint in #11943

datafusion/common/src/config.rs

alamb · 2024-10-17T17:29:39Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+                indices_cache,
+                right_side_ordered,
+                state: NestedLoopJoinStreamState::WaitBuildSide,
+                batch_transformer: BatchSplitter::new(batch_size),


this is a clever idea to parameterize the joins stream on the transformer 👍

datafusion/physical-plan/src/joins/utils.rs

alamb · 2024-10-17T18:18:02Z

datafusion/physical-plan/src/joins/symmetric_hash_join.rs

+                null_equals_null: self.null_equals_null,
+                state: SHJStreamState::PullRight,
+                reservation,
+                batch_transformer: BatchSplitter::new(batch_size),


Given the ovehead of a BatchTransformer is likely small (one function call per output batch). I suggest trying to use a dyn trait object here and in the other joins instead (e.g. batch_transformer: Box<dyn BatchTransformer>)

I suspect it would not make any noticable performance difference

Dandandan · 2024-10-18T07:48:39Z

datafusion/physical-plan/src/joins/utils.rs

+                        0,
+                        "expected left indices to have no nulls"
+                    );
+                    builder.append_slice(left_indices.values());


I wonder if at this point the left array Arc can be dropped such that the right side can be converted to builder without issue.

Co-authored-by: Andrew Lamb <[email protected]>

ozankabak · 2024-10-18T10:34:20Z

I will go ahead and merge this soon unless someone catches a critical issue.

Here are the things to work on in the near future:

Add MemoryReservation to batch splitting to add rigorous memory accountability.
Exploring if we can pass builders into methods instead of passing arrays around and converting them into builders when possible.
Considering using thebatch_transformer: Box<dyn BatchTransformer> pattern if it doesn't regress performance

alamb · 2024-10-18T13:19:39Z

I will go ahead and merge this soon unless someone catches a critical issue.

Thank you -- I agree this is better than what was on main

Here are the things to work on in the near future:

Add MemoryReservation to batch splitting to add rigorous memory accountability.

Exploring if we can pass builders into methods instead of passing arrays around and converting them into builders when possible.

Considering using thebatch_transformer: Box<dyn BatchTransformer> pattern if it doesn't regress performance

I filed #13003 to track adding memory accounting

alihan-synnada and others added 5 commits October 15, 2024 15:12

Add BatchSplitter to joins that do not respect batch size

9227335

Group relevant imports

44d0a53

Update configs.md

f79c62c

Update SQL logic tests for config

0fd4ae2

Review

a2fb772

github-actions bot added documentation Improvements or additions to documentation physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate labels Oct 16, 2024

alamb mentioned this pull request Oct 16, 2024

Limit nested loop join record batch size #12634

Closed

ozankabak approved these changes Oct 17, 2024

View reviewed changes

Dandandan reviewed Oct 17, 2024

View reviewed changes

andygrove mentioned this pull request Oct 17, 2024

Release DataFusion 43.0.0 #12470

Open

4 tasks

Use PrimitiveBuilder for PrimitiveArray concatenation

9eaf667

alamb reviewed Oct 17, 2024

View reviewed changes

Fix into_builder() bug

c5c3d5d

Dandandan reviewed Oct 18, 2024

View reviewed changes

alihan-synnada and others added 4 commits October 18, 2024 11:04

Apply suggestions from code review

34925c7

Co-authored-by: Andrew Lamb <[email protected]>

Update config docs

b098da8

Format

adb5dd5

Update config SQL Logic Test

70d78e3

ozankabak merged commit 87e931c into apache:main Oct 18, 2024
25 checks passed

alamb mentioned this pull request Oct 18, 2024

Add MemoryReservation to batch splitting in joins #13003

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split output batches of joins that do not respect batch size #12969

Split output batches of joins that do not respect batch size #12969

alihan-synnada commented Oct 16, 2024

alamb commented Oct 16, 2024

alamb commented Oct 16, 2024

ozankabak left a comment

Dandandan Oct 17, 2024

alihan-synnada Oct 17, 2024

Dandandan Oct 17, 2024

alihan-synnada Oct 17, 2024

ozankabak Oct 17, 2024

Dandandan Oct 18, 2024

mhilton commented Oct 17, 2024

ozankabak commented Oct 17, 2024

alamb left a comment

alamb Oct 17, 2024

alamb Oct 17, 2024

Dandandan Oct 18, 2024

ozankabak commented Oct 18, 2024

alamb commented Oct 18, 2024

Split output batches of joins that do not respect batch size #12969

Split output batches of joins that do not respect batch size #12969

Conversation

alihan-synnada commented Oct 16, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Oct 16, 2024

alamb commented Oct 16, 2024

ozankabak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhilton commented Oct 17, 2024

ozankabak commented Oct 17, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Oct 18, 2024

alamb commented Oct 18, 2024