Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

zaynt4606
Copy link
Contributor

@zaynt4606 zaynt4606 commented Nov 27, 2024

What changes were proposed in this pull request?

Remove the ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail.

Why are the changes needed?

Current problems:jira CELEBORN-1743
the lock in [CELEBORN-1453] might block the thread.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test
same result with CELEBORN-1453
image

@zaynt4606 zaynt4606 changed the title [CELEBORN-1743] fix metrics data interruption and job fail due to the bolck of the lock [CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. Nov 27, 2024
@turboFei
Copy link
Member

How about using threadLocal?

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Nov 27, 2024

How about using threadLocal?

To store the contents of the hashmap or else ?
These hashMaps have just been moved together from under the code, not created.

@turboFei
Copy link
Member

How about using threadLocal?

To store the contents of the hashmap or else ? These hashMaps have just been moved together from under the code, not created.

Make the innerMetrics be ThreadLocal.

@zaynt4606
Copy link
Contributor Author

How about using threadLocal?

To store the contents of the hashmap or else ? These hashMaps have just been moved together from under the code, not created.

Make the innerMetrics be ThreadLocal.

innerMetrics appears to only be used to limit capacity in the code and I think it can be removed directly. The original code retrieves data from the hashmap and puts it into innerMetrics, which affects the intended order of the queue.

@turboFei
Copy link
Member

turboFei commented Nov 27, 2024

With threadLocal, just need little code change.

turboFei@a208931

@zaynt4606
Copy link
Contributor Author

With threadLocal, just need little code change.

turboFei@7384ca6

Looks good. It can replace the lock. And change the innerMetrics.remove() to return can avoids ineffective repeated queue updates when capacity is exceeded. Ordering can also be added to this part of the code.
Does innerMetrics need to be preserved?
cc @FMX @RexXiong

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want the order of the metrics when export to be the same as when they were added, we must globally sort all the metrics.

@turboFei
Copy link
Member

Do we need to global sort the metrics?

Seems the issue mentioned in the ticket is just caused by lock?

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Nov 28, 2024

Do we need to global sort the metrics?

Seems the issue mentioned in the ticket is just caused by lock?

There are a significant number of application metrics, and I want to minimize them as they approach capacity.

This issue is unrelated to the current Jira task, so I removed the sorting code and will create another pull request to address it without sorting.

@zaynt4606
Copy link
Contributor Author

With threadLocal, just need little code change.
turboFei@7384ca6

Looks good. It can replace the lock. And change the innerMetrics.remove() to return can avoids ineffective repeated queue updates when capacity is exceeded. Ordering can also be added to this part of the code. Does innerMetrics need to be preserved? cc @FMX @RexXiong

updateInnerMetrics will be called by recordTimer <- doStopTimer <- FetchHandler: workerSource.stopTimer
not only used in getMetrics so threadLocal can't be used here.

@zhisheng17
Copy link

@zaynt4606 Thank you for your code contribution. will there be any code changes? Can I cherry pick this submission to my company now?

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Dec 2, 2024

@zaynt4606 Thank you for your code contribution. will there be any code changes? Can I cherry pick this submission to my company now?

The code will change in followUp . To solve the jira problem you can cherry pick this pr (dont need to cherry pick the followUp).

@zhisheng17
Copy link

@zaynt4606 Thank you very much, your response and code bug fixes are so fast, thumbs up to you 👍

@zhisheng17
Copy link

@zaynt4606 After I cherry-pick the code to our company, do I just need to replace the client of the worker node and restart it to solve the problem? Do I need to replace the client of the master node and restart it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants