Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44010: [C++] Add arrow::RecordBatch::MakeStatisticsArray() #44252

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kou
Copy link
Member

@kou kou commented Sep 30, 2024

Rationale for this change

Statistics schema for Arrow C data interface (GH-43553) is complex because it uses nested types (struct, map and union). So reusable implementation to make statistics array is useful.

What changes are included in this PR?

arrow::RecordBatch::MakeStatisticsArray() is a convenient function that converts arrow::ArrayStatistics in a arrow::RecordBatch to arrow::Array for the Arrow C data interface.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

Copy link

⚠️ GitHub issue #44010 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/compare.cc Outdated Show resolved Hide resolved
cpp/src/arrow/record_batch.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 30, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 903e3f4 to 92afc83 Compare September 30, 2024 08:17
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 30, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 92afc83 to b194430 Compare September 30, 2024 09:03
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 30, 2024
@kou
Copy link
Member Author

kou commented Oct 2, 2024

@pitrou @ianmcook What do you think about this?

Statistics schema https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R86-R95 is compact but it may be complex to build. Because it uses many nested types.

@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 5a00c48 to 12b1a97 Compare October 30, 2024 06:32
const std::shared_ptr<DataType>& operator()(const int64_t&) { return int64(); }
const std::shared_ptr<DataType>& operator()(const uint64_t&) { return uint64(); }
const std::shared_ptr<DataType>& operator()(const double&) { return float64(); }
const std::shared_ptr<DataType>& operator()(const std::string&) { return utf8(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may forgot a bit but we don't distinct "bytes" and "utf8" in stats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we didn't discuss it...
Let's discuss it in #44579.

We can assume "utf8" here for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a // TODO(GH-44579) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I should have added it...
I've added it.

@@ -80,6 +80,24 @@ struct ArrowArray {
void* private_data;
};

# define ARROW_STATISTICS_KEY_AVERAGE_BYTE_WIDTH_EXACT "ARROW:average_byte_width:exact"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know constexpr std::string_view is better or this is better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use constexpr because this header may be used by C programs.

statistics.nth_statistics = 0;
statistics.start_new_column = true;
statistics.nth_column = std::nullopt;
statistics.key = ARROW_STATISTICS_KEY_ROW_COUNT_EXACT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So RowCount is also handled as a stats 🤔?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statistics array will be passed to consumer before consumer receives a record batch.
So this may be useful for consumer.

But DuckDB doesn't have row count in its BaseStatistics...: https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146
This may not be useful...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this for now to demonstrate table/record batch level statistics.

cpp/src/arrow/record_batch.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 30, 2024
@kou kou force-pushed the cpp-record-batch-make-statistics-array branch from 8e4d618 to 9c529d1 Compare October 31, 2024 23:48
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 31, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a careful pass tonight

const std::shared_ptr<DataType>& operator()(const int64_t&) { return int64(); }
const std::shared_ptr<DataType>& operator()(const uint64_t&) { return uint64(); }
const std::shared_ptr<DataType>& operator()(const double&) { return float64(); }
const std::shared_ptr<DataType>& operator()(const std::string&) { return utf8(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a // TODO(GH-44579) here?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM but I'm not an expert on C ABI and data layer


// Statistics schema doesn't define static dense union type for
// values. Each statistics schema have a dense union type that has
// needled value types. The following block collects these types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually this is logically a "set" prepared for items?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

If there are the same types, the first type is only used.

};
using OnStatistics =
std::function<Status(const EnumeratedStatistics& enumerated_statistics)>;
Status EnumerateStatistics(const RecordBatch& record_batch, OnStatistics on_statistics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually this is for a two-phase building, one pass for types, and one-pass for data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

I think that it's one of complexities.
So I sent https://lists.apache.org/thread/0c9jftkspvj7yw1lpo73s3vtp6vfjqv8 to the mailing list. But nobody agreed it. So this complexity will be acceptable...

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants