-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Document Snapshot Summary Optional Fields for Standardization #11660
base: main
Are you sure you want to change the base?
Conversation
format/spec.md
Outdated
| **`added-equality-deletes`** | Number of equality delete records added in the current snapshot | Yes | | ||
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes | | ||
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No | | ||
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear where duplicate files are coming from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from a check in ManifestFilterManager:
iceberg/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Lines 492 to 497 in 90be5d7
if (deletedFiles.contains(file)) { | |
LOG.warn( | |
"Deleting a duplicate path from manifest {}: {}", | |
manifest.path(), | |
file.location()); | |
duplicateDeleteCount += 1; |
"Duplicate" means there are 2 manifest entry pointing to the same data file in one Manifest.
How about
Number of duplicate files deleted, where duplicates are files recorded more than once in the manifest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer Thanks for reviewing! I moved the whole section to Appendix G and update field descriptions accordingly
format/spec.md
Outdated
| **`added-equality-deletes`** | Number of equality delete records added in the current snapshot | Yes | | ||
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes | | ||
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No | | ||
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from a check in ManifestFilterManager:
iceberg/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Lines 492 to 497 in 90be5d7
if (deletedFiles.contains(file)) { | |
LOG.warn( | |
"Deleting a duplicate path from manifest {}: {}", | |
manifest.path(), | |
file.location()); | |
duplicateDeleteCount += 1; |
"Duplicate" means there are 2 manifest entry pointing to the same data file in one Manifest.
How about
Number of duplicate files deleted, where duplicates are files recorded more than once in the manifest
format/spec.md
Outdated
| **`wap.id`** | "12345678" | The Write-Audit-Publish id of a staged snapshot | | ||
| **`published-wap-id`** | "12345678" | The Write-Audit-Publish id of a snapshot already been published | | ||
| **`source-snapshot-id`** | "12345678" | The id of the snapshot picked to be cherry-picked | | ||
| **`replace-partitions`** | `true` | Whether the operation is a `ReplacePartitions` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a helpful description here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! Let me add more context for ReplacePartitions
here.
That said, I’m on the fence about including this field. Most users likely won’t encounter it, as ReplacePartitions
is generally not recommended. I decided to include it to make the appendix more comprehensive and to "reserve" the field name to prevent potential conflicts.
Link for reference: ReplacePartitions.java
format/spec.md
Outdated
| **`deleted-duplicate-files`** | Number of duplicate files deleted, where duplicates are files recorded more than once in the manifest | No | | ||
| **`changed-partition-count`** | Number of partitions with files added or removed in the snapshot | No | | ||
|
||
### Partition-Level Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should document this, the string representation we use for the path is not really well documented and in general we probably don't want folks including partition level summaries. Additionally it references a Table property (write.summary.partition-limit) which I really would like to avoid. I think we should stick to the table level stats above and remove this section and the associated column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I also found it challenging to document partition-level summaries while writing the appendix, as they require explanations for a lot of context. Referencing write.summary.partition-limit
directly in the table spec also feels out of place. Additionally, most users won’t encounter partition-level summaries because the default value for write.summary.partition-limit
is 0
.
Even so, I think it’s still helpful to document this information somewhere for those who might enable or work with partition-level summaries. One option could be to move the description to the Table Configuration section and include a note under write.summary.partition-limit to explain what partition-level summaries mean. This approach ties the information to the property users would need to modify if they choose to enable this feature. The downside is that snapshot summaries would end up being documented in two places, but this might be a reasonable tradeoff to ensure clarity and usability.
Do you think this sound like a better idea?
format/spec.md
Outdated
## Appendix G: Optional Snapshot Summary Fields | ||
|
||
### Metrics | ||
Snapshot summary can include metrics fields to track numeric stats of the snapshot. The value of these fields should be numeric strings (e.g., `"120"`). | ||
Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary). | ||
Metrics must be accurate if written, as engines may rely on them for optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wydt about moving this description up one level, between "Appendix G" and "Metrics" since this is describing generally what fields belong to the Snapshot Summary. It should apply to all subheaders ( "Metrics", "Partition-Level", "Other")
## Appendix G: Optional Snapshot Summary Fields | |
### Metrics | |
Snapshot summary can include metrics fields to track numeric stats of the snapshot. The value of these fields should be numeric strings (e.g., `"120"`). | |
Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary). | |
Metrics must be accurate if written, as engines may rely on them for optimization. | |
## Appendix G: Optional Snapshot Summary Fields | |
Snapshot summary can include metrics fields to track numeric stats of the snapshot, see [Metrics](#metrics). | |
Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary). | |
The value of these fields should be of string type (e.g., `"120"`). | |
### Metrics | |
Metrics must be accurate if written, as engines may rely on them for optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've moved the the description up one level and make it mention all the sub-sections. I removed the "partition-level summary" based on Russell's suggestion above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this new appendix section
This PR introduces a new section, "Snapshot Summary", in the table spec under Snapshots to document optional fields in the snapshot summary, including metrics, partition-level summaries, and other fields such as Write-Audit-Publish (WAP)-related fields and ReplacePartitions indicators. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.
Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
Marked as Draft because this is subject to change based on discussion on dev list