-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956
base: main
Are you sure you want to change the base?
Conversation
How about using threadLocal? |
To store the contents of the hashmap or else ? |
Make the |
|
With threadLocal, just need little code change. |
Looks good. It can replace the lock. And change the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want the order of the metrics when export to be the same as when they were added, we must globally sort all the metrics.
Do we need to global sort the metrics? Seems the issue mentioned in the ticket is just caused by lock? |
There are a significant number of application metrics, and I want to minimize them as they approach capacity. This issue is unrelated to the current Jira task, so I removed the sorting code and will create another pull request to address it without sorting. |
|
@zaynt4606 Thank you for your code contribution. will there be any code changes? Can I cherry pick this submission to my company now? |
The code will change in followUp . To solve the jira problem you can cherry pick this pr (dont need to cherry pick the followUp). |
@zaynt4606 Thank you very much, your response and code bug fixes are so fast, thumbs up to you 👍 |
@zaynt4606 After I cherry-pick the code to our company, do I just need to replace the client of the worker node and restart it to solve the problem? Do I need to replace the client of the master node and restart it? |
What changes were proposed in this pull request?
Remove the ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail.
Why are the changes needed?
Current problems:jira CELEBORN-1743
the lock in [CELEBORN-1453] might block the thread.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manual test
same result with CELEBORN-1453