Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Determine in advance whether there are changed fields, otherwise the latest schama information will not be obtained. #4521

Open
2 tasks done
GangYang-HX opened this issue Nov 13, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@GangYang-HX
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Optimize the logic of org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunctionBase#extractSchemaChanges: prioritize whether updatedDataFields is empty to avoid accessing the latest schema information every time

Solution

prioritize whether updatedDataFields is empty to avoid accessing the latest schema information every time

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@GangYang-HX GangYang-HX added the enhancement New feature or request label Nov 13, 2024
@GangYang-HX
Copy link
Contributor Author

Background
image
image
Here, each thread maintains a set of field information separately. If different threads process the same field one after another, the shema change check will be triggered. In this case, the latest schame information will be frequently obtained, resulting in a decrease in the overall throughput of the task, which will cause subsequent exceptions such as checkpoint failure.

Solution
Maintain a state cache for the latest shema information to avoid direct access to the file system.

@GangYang-HX
Copy link
Contributor Author

For example, the Paimon table has 1500 fields, the Parallelism of the Write operator is 500, and the task is restarted. In extreme cases, it will trigger 1500500 calls to the latest schema information. If each call takes 20ms, the total time is: 1500500*30ms=6.25h. This will greatly affect the throughput of the task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant