Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: multi column stats #160

Merged
merged 13 commits into from
Apr 16, 2024
Merged

feat: multi column stats #160

merged 13 commits into from
Apr 16, 2024

Conversation

AlSchlo
Copy link
Contributor

@AlSchlo AlSchlo commented Apr 15, 2024

This PR adds support for multi column statistics & minor bug-fixes.

let mut rows = Vec::<Vec<Option<Value>>>::new();
for (col, col_type) in cols.iter().zip(col_types.iter()) {
for (idx, val) in Self::to_typed_column(col, col_type).iter().enumerate() {
rows[idx].push(val.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause out-of-bound access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, i'll add some tests

Comment on lines 435 to 437
let (fully_null_rows, non_fully_null_rows): (Vec<_>, Vec<_>) = rows
.into_iter()
.partition(|row| row.iter().all(|v| v.is_none()));
Copy link
Contributor

@Gun9niR Gun9niR Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation might be very costly, as all the rows are copied once more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right... well the previous operation had a "flatten" and a "collect". that also copied stuff no?

Comment on lines 267 to 273
pub fn extend_with_multi_column_stats<
I: IntoIterator<Item = Result<RecordBatch, ArrowError>>,
>(
&mut self,
batch_iter_builder: impl Fn() -> anyhow::Result<RecordBatchIterator<I>>,
combinations: Vec<Vec<usize>>,
) -> anyhow::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to use this function? Looks like each invocation is two more scans.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each invocation is len(combinations) x 2 scans, right now, for simplicity
it can be made x2 scans only

actually, i will fully refactor all this before merging anything

@@ -103,7 +105,7 @@ pub struct PerColumnStats<M: MostCommonValues, D: Distribution> {

// distribution _does not_ include the values in mcvs
// distribution _does not_ include nulls
pub distr: D,
pub distr: Option<D>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate PerColumnStats and CombColumnStats? Just in case they diverge further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, probably. also has a better naming

@AlSchlo AlSchlo changed the title Multi column stats feat: multi column stats Apr 16, 2024
@AlSchlo AlSchlo merged commit e282624 into main Apr 16, 2024
1 check passed
@AlSchlo AlSchlo deleted the multi-column-stats branch April 16, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants