-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add csv loading benchmarks. #13544
base: main
Are you sure you want to change the base?
Add csv loading benchmarks. #13544
Conversation
Hi @dhegberg, thank you for your contribution—it’s well-written and formatted nicely. We have a dedicated path for operator-specific benchmarks: https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/benches. It seems to me that the measured functionality falls under the scope of |
I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations. For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks? |
I don't want to misguide you. Perhaps @alamb can direct you better for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @dhegberg -- I agree with @berkaysynnada that this PR is nicely coded and well commented 🙏 and that it might be better to add it as a more focused "unit test"
Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?
My suspicion is that the first run pulls the data from storage (e.g. SSD) into the kernel page cache, and then subsequent runs are all in memory (no I/O).
For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?
I don't want to misguide you. Perhaps @alamb can direct you better for that.
In terms of what is in benchmarks
I think it was meant to be "end to end" benchmarks in the style of tpch or clickbench: a known dataset, some queries, and then we can use the benchmarking framework to drive those queries faster and faster (as well as run the queries independently using datafusion-cli
or datafusion-python
)
I would recommend moving this benchmark to https://github.com/apache/datafusion/tree/main/datafusion/core/benches
perhaps csv.rs
or datasource.rs
impl RunOpt { | ||
pub async fn run(self) -> Result<()> { | ||
let test_file = self.data.build()?; | ||
let mut rundata = BenchmarkRun::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I would like to request is that we split the data generation from the query.
Given this setup, rerunning the benchmarks will likely be dominated by the time it takes to regenerate the input which will be quite slow
Which issue does this PR close?
Related to #12904
Rationale for this change
Requested in comments for
https://github.com/apache/datafusion/pull/13228
A direct testing on loading csv files was identified as a gap in the benchmarking suite.
What changes are included in this PR?
Basic benchmarks related to loading csv files.
Are these changes tested?
Tested via
./bench.sh run csv
Logged output:
results file:
csv.json
Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new
SessionContext
not sufficient to remove any cache in loading?Are there any user-facing changes?
No