Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make into_arrow truly zero-copy, rewrite DataFusion operators #451

Merged
merged 3 commits into from
Jul 12, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jul 11, 2024

Before we would go through the ArrowBuffer::from(&[u8]) constructor, which would actually do a full data copy.

Now we're just doing pointer tricks. Shaved about 30ms (~7%) off of the tpc-h benchmark.

Before we would go through the ArrowBuffer::from(&[u8]) constructor, which
would actually do a full data copy.

Now we're just doing pointer tricks. Shaved about 30ms (~7%) off of the tpc-h
benchmark.
data.into(),
nulls,
)),
PType::I32 => Arc::new(unsafe {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much this actually saves us, once I get to ~parity with arrow numbers I'll turn the validations back on and see what difference it makes

@a10y
Copy link
Contributor Author

a10y commented Jul 12, 2024

I'm using this as a more general improve-the-benchmarks PR, found a few more allocations and rewriting the vortex memory scan exec, should have something ready to ship later this morning

@a10y a10y marked this pull request as ready for review July 12, 2024 17:03
@a10y
Copy link
Contributor Author

a10y commented Jul 12, 2024

Rewrote the datafusion operators to operate over ChunkedArray rather than very large single arrays. All operators still operate as a single partition, but now they emit streams with one element per chunk.

This plus eliminating all of the extra allocations brings us down to parity with DataFusion's builtin Arrow MemTable for TPC-H query 1.

image

Comment on lines +384 to +390
f.debug_struct("VortexScanExec")
.field("array_length", &self.array.len())
.field("array_dtype", &self.array.dtype())
.field("scan_projection", &self.scan_projection)
.field("plan_properties", &self.plan_properties)
.finish_non_exhaustive()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with custom impl because Array logs the entire buffer contents as part of its Debug impl

Stdlib has a gated unstable feature to log a field using the contents of a closure, so in the future you'd be able to do something like

        f.debug_struct("VortexScanExec")
            .field_with("array_length", |mut f| {
                 f.debug_struct("Array")
                     .field("length", &self.array.len())
                     .field("dtype", self.array.dtype())
                     .finish()
             })

https://doc.rust-lang.org/std/fmt/struct.DebugStruct.html#method.field_with

@a10y a10y changed the title Make into_arrow truly zero-copy for Primitive, VarBin Make into_arrow truly zero-copy, rewrite DataFusion operators Jul 12, 2024
@a10y
Copy link
Contributor Author

a10y commented Jul 12, 2024

The spread between vortex-pushdown and vortex-nopushdown is all of the takeing, particularly for VarBin, which requires rebuilding the whole string table. If we had VarBinView supported in datafusion, this operation would be considerably faster.

image

@a10y a10y merged commit f32dc4d into develop Jul 12, 2024
2 checks passed
@a10y a10y deleted the aduffy/zero-copy branch July 12, 2024 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants