Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add csv loading benchmarks. #13544

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

dhegberg
Copy link
Contributor

Which issue does this PR close?

Related to #12904

Rationale for this change

Requested in comments for https://github.com/apache/datafusion/pull/13228

A direct testing on loading csv files was identified as a gap in the benchmarking suite.

What changes are included in this PR?

Basic benchmarks related to loading csv files.

Are these changes tested?

Tested via ./bench.sh run csv

Logged output:

Running csv load benchmarks.
Generated test dataset with 10240283 rows
Executing 'CSV Load Speed Test.'
Iteration 0 finished in 7.079167 ms.
Iteration 1 finished in 3.3643750000000003 ms.
Iteration 2 finished in 3.2645 ms.
Iteration 3 finished in 3.311208 ms.
Iteration 4 finished in 3.319 ms.
Done

results file:
csv.json

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Nov 24, 2024
@berkaysynnada
Copy link
Contributor

Hi @dhegberg, thank you for your contribution—it’s well-written and formatted nicely. We have a dedicated path for operator-specific benchmarks: https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/benches. It seems to me that the measured functionality falls under the scope of CsvExec. Do you think it would be better to include these benchmarks there?

@dhegberg
Copy link
Contributor Author

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

@berkaysynnada
Copy link
Contributor

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @dhegberg -- I agree with @berkaysynnada that this PR is nicely coded and well commented 🙏 and that it might be better to add it as a more focused "unit test"

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

My suspicion is that the first run pulls the data from storage (e.g. SSD) into the kernel page cache, and then subsequent runs are all in memory (no I/O).

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

In terms of what is in benchmarks I think it was meant to be "end to end" benchmarks in the style of tpch or clickbench: a known dataset, some queries, and then we can use the benchmarking framework to drive those queries faster and faster (as well as run the queries independently using datafusion-cli or datafusion-python)

I would recommend moving this benchmark to https://github.com/apache/datafusion/tree/main/datafusion/core/benches

perhaps csv.rs or datasource.rs

impl RunOpt {
pub async fn run(self) -> Result<()> {
let test_file = self.data.build()?;
let mut rundata = BenchmarkRun::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I would like to request is that we split the data generation from the query.

Given this setup, rerunning the benchmarks will likely be dominated by the time it takes to regenerate the input which will be quite slow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants