Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-2165] Support partial commit semantic from CopyDataPublisher #4066

Merged

Conversation

Will-Lo
Copy link
Contributor

@Will-Lo Will-Lo commented Oct 7, 2024

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    When Gobblin is using partial commit semantics, it is possible for files to be missing on the destination after the job is completed (in line with the semantic where any number of files may be missing, as long as one file is successfully published it can still be considered successful).

Currently, Gobblin will raise a FileNotFoundException when this file is being published if its performs a metadata sync, which prevents the job from running through its postpublish steps. We still want the postpublish steps to go through as they may contain important steps for the job to be considered at least partially successful (permission sync, table registration).

This PR allows one to add the configurations

job.commit.policy: successful // or job.commit.policy=partial
job.commit.partial.fail.task.fails.job.commit=true

And have a job publish all of its successful files but still mark the job as failed.
It is important that certain table copies (e.g. iceberg distcp) should not be supporting partial copy semantics in the same way as it would break the table manifest. Also care should be taken with hive copies so that subsequent copies will still copy the missing files (turn off fast partition skip predicate).

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the commit message is confusing - This PR allows one to add the configurations... and requires really close reading to realize it's not really about those two partial success params, but more about preventing internal corruption when they're enabled.

it does however raise a good point about how the params would be problematic for iceberg-distcp. would we want the IcebergDatasetFinder to fail-fast if it finds those params set?

if (wus.getWorkingState() == WorkingState.COMMITTED) {
preserveFileAttrInPublisher(copyableFile);
Copy link
Contributor

@phet phet Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad the fix was a mere reordering. still, that is so subtle, it definitely deserves a comment (e.g. about what would transpire if preserving attrs prior to verifying the associate file was COMMITTED)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

@Will-Lo Will-Lo force-pushed the support-partial-semantics-distcp-publish branch from f4e2a09 to 433b42f Compare October 8, 2024 00:20
@Will-Lo Will-Lo merged commit 2afab69 into apache:master Oct 8, 2024
5 of 6 checks passed
@Will-Lo Will-Lo deleted the support-partial-semantics-distcp-publish branch October 8, 2024 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants