Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subcommand stats to support data stream analysis. #134

Merged
merged 3 commits into from
Aug 7, 2024

Conversation

linlin-s
Copy link
Contributor

Issue #, if available:

#42
This PR is the updated version of PR#104. Since a lot of refactoring has been done, creating a new PR would be easier to review.

Description of changes:

This PR adds option stats to analyze the input binary Ion data stream, allowing users to:

  • Get the maximum depth of the input data stream.
  • Get insights into the top-level values, including the minimum, maximum and average size, size distributions, and total number of the top-level values.
  • Gain insights into the symbol table, including the ability to request the number of local symbol tables and the number of symbols in the provided Ion stream.

This PR also removes count since stats already includes this information.

Test:

A unit test has been added at the end of stats.rs
Input: ion --unstable stats test.10n
Output:
image


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

.expect("There is an error occurred while writing the symtab_count.");
writeln!(
writer,
"The maximum depth of the input data stream is {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The maximum depth of the input data stream is {}",
"The maximum container depth is {}",

let histogram = plot::Histogram::new(&out.size_vec, options);
writeln!(
writer,
"The 'samples' field represents the total number of top-level value of input data stream. The unit of min, max ,avg size is bytes.\n\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The 'samples' field represents the total number of top-level value of input data stream. The unit of min, max ,avg size is bytes.\n\
"The 'samples' field represents the total number of top-level value of input data stream. The unit of min, max, avg size is bytes.\n\

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than explain what 'samples' means here, let's just rename 'samples' to "top-level values".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

samples cannot be renamed because all the statistical results are calculated and written in stats when we initialize Histogram. Here is the implementation of displaying Stats. According to the implementation, samples is written while formatting the results of stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We could always format the Histogram ourselves, but this is probably fine for now.

.expect("There is an error occurred while writing the symbols_count.");
writeln!(
writer,
"The number of local symbol tables is {} ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local symbol tables

Suggested change
"The number of local symbol tables is {} ",
"Local symbol tables: {} ",

histogram
)
.expect("There is an error occurred while plotting the size distribution of input data stream.");
writeln!(writer, "The number of symbols is {} ", out.symbols_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
writeln!(writer, "The number of symbols is {} ", out.symbols_count)
writeln!(writer, "Symbols: {} ", out.symbols_count)

.expect("There is an error occurred while writing the symtab_count.");
writeln!(
writer,
"The maximum depth of the input data stream is {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The maximum depth of the input data stream is {}",
"Maximum container depth: {}",

.expect_list()
.unwrap()
.iter()
.map(|v| top_level_max_depth(v.unwrap(), depth + 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the number of times we've seen the unbounded recursion thing pop up, let's do this without recursion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, using iteration method would be a better option here.

* Using iteration method to calculate the maximum depth of container.
for field in struct_value {
stack.push((field.unwrap().value(), depth + 1));
}
} else if current_value.ion_type() == IonType::List {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should include s-expressions too, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling it out, s-expressions should be processed too. In my opinion, there should be another branch to process s-expressions specifically. Please correct me if I'm wrong, but as I understand it, the current way we read the value out of LazyValue doesn't allow us to merge both lists and s-expressions into one branch. We can only call current_value.read().unwrap().expect_list() and current_value.read().unwrap().expect_sexp() separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine.

@linlin-s linlin-s merged commit 7d1dfb6 into master Aug 7, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants