Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Document Snapshot Summary Optional Fields for Standardization #11660

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Nov 26, 2024

This PR introduces a new section, "Snapshot Summary", in the table spec under Snapshots to document optional fields in the snapshot summary, including metrics, partition-level summaries, and other fields such as Write-Audit-Publish (WAP)-related fields and ReplacePartitions indicators. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.

Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing

Marked as Draft because this is subject to change based on discussion on dev list

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Nov 26, 2024
@HonahX HonahX linked an issue Nov 26, 2024 that may be closed by this pull request
6 tasks
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
| **`added-equality-deletes`** | Number of equality delete records added in the current snapshot | Yes |
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes |
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No |
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear where duplicate files are coming from

Copy link
Contributor Author

@HonahX HonahX Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from a check in ManifestFilterManager:

if (deletedFiles.contains(file)) {
LOG.warn(
"Deleting a duplicate path from manifest {}: {}",
manifest.path(),
file.location());
duplicateDeleteCount += 1;

"Duplicate" means there are 2 manifest entry pointing to the same data file in one Manifest.

How about

Number of duplicate files deleted, where duplicates are files recorded more than once in the manifest

format/spec.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer Thanks for reviewing! I moved the whole section to Appendix G and update field descriptions accordingly

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
| **`added-equality-deletes`** | Number of equality delete records added in the current snapshot | Yes |
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes |
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No |
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No |
Copy link
Contributor Author

@HonahX HonahX Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from a check in ManifestFilterManager:

if (deletedFiles.contains(file)) {
LOG.warn(
"Deleting a duplicate path from manifest {}: {}",
manifest.path(),
file.location());
duplicateDeleteCount += 1;

"Duplicate" means there are 2 manifest entry pointing to the same data file in one Manifest.

How about

Number of duplicate files deleted, where duplicates are files recorded more than once in the manifest

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
@HonahX HonahX marked this pull request as ready for review January 3, 2025 01:06
format/spec.md Outdated
| **`wap.id`** | "12345678" | The Write-Audit-Publish id of a staged snapshot |
| **`published-wap-id`** | "12345678" | The Write-Audit-Publish id of a snapshot already been published |
| **`source-snapshot-id`** | "12345678" | The id of the snapshot picked to be cherry-picked |
| **`replace-partitions`** | `true` | Whether the operation is a `ReplacePartitions` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a helpful description here :)

Copy link
Contributor Author

@HonahX HonahX Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! Let me add more context for ReplacePartitions here.

That said, I’m on the fence about including this field. Most users likely won’t encounter it, as ReplacePartitions is generally not recommended. I decided to include it to make the appendix more comprehensive and to "reserve" the field name to prevent potential conflicts.

Link for reference: ReplacePartitions.java

format/spec.md Outdated
| **`deleted-duplicate-files`** | Number of duplicate files deleted, where duplicates are files recorded more than once in the manifest | No |
| **`changed-partition-count`** | Number of partitions with files added or removed in the snapshot | No |

### Partition-Level Summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should document this, the string representation we use for the path is not really well documented and in general we probably don't want folks including partition level summaries. Additionally it references a Table property (write.summary.partition-limit) which I really would like to avoid. I think we should stick to the table level stats above and remove this section and the associated column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I also found it challenging to document partition-level summaries while writing the appendix, as they require explanations for a lot of context. Referencing write.summary.partition-limit directly in the table spec also feels out of place. Additionally, most users won’t encounter partition-level summaries because the default value for write.summary.partition-limit is 0.

Even so, I think it’s still helpful to document this information somewhere for those who might enable or work with partition-level summaries. One option could be to move the description to the Table Configuration section and include a note under write.summary.partition-limit to explain what partition-level summaries mean. This approach ties the information to the property users would need to modify if they choose to enable this feature. The downside is that snapshot summaries would end up being documented in two places, but this might be a reasonable tradeoff to ensure clarity and usability.

Do you think this sound like a better idea?

format/spec.md Outdated
Comment on lines 1636 to 1641
## Appendix G: Optional Snapshot Summary Fields

### Metrics
Snapshot summary can include metrics fields to track numeric stats of the snapshot. The value of these fields should be numeric strings (e.g., `"120"`).
Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary).
Metrics must be accurate if written, as engines may rely on them for optimization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wydt about moving this description up one level, between "Appendix G" and "Metrics" since this is describing generally what fields belong to the Snapshot Summary. It should apply to all subheaders ( "Metrics", "Partition-Level", "Other")

Suggested change
## Appendix G: Optional Snapshot Summary Fields
### Metrics
Snapshot summary can include metrics fields to track numeric stats of the snapshot. The value of these fields should be numeric strings (e.g., `"120"`).
Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary).
Metrics must be accurate if written, as engines may rely on them for optimization.
## Appendix G: Optional Snapshot Summary Fields
Snapshot summary can include metrics fields to track numeric stats of the snapshot, see [Metrics](#metrics).
Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary).
The value of these fields should be of string type (e.g., `"120"`).
### Metrics
Metrics must be accurate if written, as engines may rely on them for optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've moved the the description up one level and make it mention all the sub-sections. I removed the "partition-level summary" based on Russell's suggestion above.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this new appendix section

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Snapshot Summary Optional Fields for Standardization
3 participants