Add csv loading benchmarks. #13544

dhegberg · 2024-11-24T03:26:10Z

Which issue does this PR close?

Related to #12904

Rationale for this change

Requested in comments for https://github.com/apache/datafusion/pull/13228

A direct testing on loading csv files was identified as a gap in the benchmarking suite.

What changes are included in this PR?

Basic benchmarks related to loading csv files.

Are these changes tested?

Tested via ./bench.sh run csv

Logged output:

Running csv load benchmarks.
Generated test dataset with 10240283 rows
Executing 'CSV Load Speed Test.'
Iteration 0 finished in 7.079167 ms.
Iteration 1 finished in 3.3643750000000003 ms.
Iteration 2 finished in 3.2645 ms.
Iteration 3 finished in 3.311208 ms.
Iteration 4 finished in 3.319 ms.
Done

results file:
csv.json

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

Are there any user-facing changes?

No

berkaysynnada · 2024-11-25T08:08:53Z

Hi @dhegberg, thank you for your contribution—it’s well-written and formatted nicely. We have a dedicated path for operator-specific benchmarks: https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/benches. It seems to me that the measured functionality falls under the scope of CsvExec. Do you think it would be better to include these benchmarks there?

dhegberg · 2024-11-25T15:18:55Z

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

berkaysynnada · 2024-11-25T17:06:32Z

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

alamb

Thank you very much @dhegberg -- I agree with @berkaysynnada that this PR is nicely coded and well commented 🙏 and that it might be better to add it as a more focused "unit test"

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

My suspicion is that the first run pulls the data from storage (e.g. SSD) into the kernel page cache, and then subsequent runs are all in memory (no I/O).

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

In terms of what is in benchmarks I think it was meant to be "end to end" benchmarks in the style of tpch or clickbench: a known dataset, some queries, and then we can use the benchmarking framework to drive those queries faster and faster (as well as run the queries independently using datafusion-cli or datafusion-python)

I would recommend moving this benchmark to https://github.com/apache/datafusion/tree/main/datafusion/core/benches

perhaps csv.rs or datasource.rs

alamb · 2024-11-25T21:08:48Z

benchmarks/src/csv/load.rs

+impl RunOpt {
+    pub async fn run(self) -> Result<()> {
+        let test_file = self.data.build()?;
+        let mut rundata = BenchmarkRun::new();


One thing I would like to request is that we split the data generation from the query.

Given this setup, rerunning the benchmarks will likely be dominated by the time it takes to regenerate the input which will be quite slow

Add csv loading benchmarks.

3d6abbc

github-actions bot added the core Core DataFusion crate label Nov 24, 2024

dhegberg added 3 commits November 23, 2024 19:29

Fix format.

2b9171e

Fix docs.

2f253a7

Fix clone_on_ref_ptr errors.

460e212

alamb reviewed Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add csv loading benchmarks. #13544

Add csv loading benchmarks. #13544

dhegberg commented Nov 24, 2024

berkaysynnada commented Nov 25, 2024

dhegberg commented Nov 25, 2024

berkaysynnada commented Nov 25, 2024

alamb left a comment

alamb Nov 25, 2024

Add csv loading benchmarks. #13544

Are you sure you want to change the base?

Add csv loading benchmarks. #13544

Conversation

dhegberg commented Nov 24, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada commented Nov 25, 2024

dhegberg commented Nov 25, 2024

berkaysynnada commented Nov 25, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 25, 2024

Choose a reason for hiding this comment