[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956

zaynt4606 · 2024-11-27T04:20:54Z

What changes were proposed in this pull request?

Remove the ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail.

Why are the changes needed?

Current problems：jira CELEBORN-1743
the lock in [CELEBORN-1453] might block the thread.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test
same result with CELEBORN-1453

turboFei · 2024-11-27T06:26:52Z

How about using threadLocal?

zaynt4606 · 2024-11-27T06:33:38Z

How about using threadLocal?

To store the contents of the hashmap or else ?
These hashMaps have just been moved together from under the code, not created.

turboFei · 2024-11-27T06:34:20Z

How about using threadLocal?

To store the contents of the hashmap or else ? These hashMaps have just been moved together from under the code, not created.

Make the innerMetrics be ThreadLocal.

zaynt4606 · 2024-11-27T06:41:00Z

How about using threadLocal?

To store the contents of the hashmap or else ? These hashMaps have just been moved together from under the code, not created.

Make the innerMetrics be ThreadLocal.

innerMetrics appears to only be used to limit capacity in the code and I think it can be removed directly. The original code retrieves data from the hashmap and puts it into innerMetrics, which affects the intended order of the queue.

turboFei · 2024-11-27T06:59:28Z

With threadLocal, just need little code change.

turboFei@a208931

zaynt4606 · 2024-11-27T07:26:49Z

With threadLocal, just need little code change.

turboFei@7384ca6

Looks good. It can replace the lock. And change the innerMetrics.remove() to return can avoids ineffective repeated queue updates when capacity is exceeded. Ordering can also be added to this part of the code.
Does innerMetrics need to be preserved？
cc @FMX @RexXiong

RexXiong

If we want the order of the metrics when export to be the same as when they were added, we must globally sort all the metrics.

turboFei · 2024-11-28T08:47:26Z

Do we need to global sort the metrics?

Seems the issue mentioned in the ticket is just caused by lock?

zaynt4606 · 2024-11-28T09:16:59Z

Do we need to global sort the metrics?

Seems the issue mentioned in the ticket is just caused by lock?

There are a significant number of application metrics, and I want to minimize them as they approach capacity.

This issue is unrelated to the current Jira task, so I removed the sorting code and will create another pull request to address it without sorting.

zaynt4606 · 2024-11-28T09:25:21Z

With threadLocal, just need little code change.
turboFei@7384ca6

Looks good. It can replace the lock. And change the innerMetrics.remove() to return can avoids ineffective repeated queue updates when capacity is exceeded. Ordering can also be added to this part of the code. Does innerMetrics need to be preserved？ cc @FMX @RexXiong

updateInnerMetrics will be called by recordTimer <- doStopTimer <- FetchHandler: workerSource.stopTimer
not only used in getMetrics so threadLocal can't be used here.

zhisheng17 · 2024-12-02T06:46:33Z

@zaynt4606 Thank you for your code contribution. will there be any code changes？ Can I cherry pick this submission to my company now?

zaynt4606 · 2024-12-02T06:54:44Z

@zaynt4606 Thank you for your code contribution. will there be any code changes？ Can I cherry pick this submission to my company now?

The code will change in followUp . To solve the jira problem you can cherry pick this pr (dont need to cherry pick the followUp).

zhisheng17 · 2024-12-02T07:33:13Z

@zaynt4606 Thank you very much, your response and code bug fixes are so fast, thumbs up to you 👍

zhisheng17 · 2024-12-02T07:39:36Z

@zaynt4606 After I cherry-pick the code to our company, do I just need to replace the client of the worker node and restart it to solve the problem? Do I need to replace the client of the master node and restart it?

clb1734 bug fix

3bf75a2

zaynt4606 force-pushed the clb1743 branch from a5d1e50 to 3bf75a2 Compare November 27, 2024 05:41

zaynt4606 changed the title ~~[CELEBORN-1743] fix metrics data interruption and job fail due to the bolck of the lock~~ [CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. Nov 27, 2024

RexXiong reviewed Nov 27, 2024

View reviewed changes

zaynt4606 added 2 commits November 28, 2024 10:52

remove sort

41f1834

remove useless sort

f5d21ac

zaynt4606 added 2 commits November 29, 2024 09:40

fill log

1696f16

format

52d7cd8

zaynt4606 mentioned this pull request Nov 29, 2024

[CELEBORN-1743] [FOLLOWUP] Make the application metrics queue later when the capacity is full without sorting #2964

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956

[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956

zaynt4606 commented Nov 27, 2024 •

edited

Loading

turboFei commented Nov 27, 2024

zaynt4606 commented Nov 27, 2024 •

edited

Loading

turboFei commented Nov 27, 2024

zaynt4606 commented Nov 27, 2024

turboFei commented Nov 27, 2024 •

edited

Loading

zaynt4606 commented Nov 27, 2024

RexXiong left a comment

turboFei commented Nov 28, 2024

zaynt4606 commented Nov 28, 2024 •

edited

Loading

zaynt4606 commented Nov 28, 2024

zhisheng17 commented Dec 2, 2024

zaynt4606 commented Dec 2, 2024 •

edited

Loading

zhisheng17 commented Dec 2, 2024

zhisheng17 commented Dec 2, 2024

[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956

Are you sure you want to change the base?

[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956

Conversation

zaynt4606 commented Nov 27, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

turboFei commented Nov 27, 2024

zaynt4606 commented Nov 27, 2024 • edited Loading

turboFei commented Nov 27, 2024

zaynt4606 commented Nov 27, 2024

turboFei commented Nov 27, 2024 • edited Loading

zaynt4606 commented Nov 27, 2024

RexXiong left a comment

Choose a reason for hiding this comment

turboFei commented Nov 28, 2024

zaynt4606 commented Nov 28, 2024 • edited Loading

zaynt4606 commented Nov 28, 2024

zhisheng17 commented Dec 2, 2024

zaynt4606 commented Dec 2, 2024 • edited Loading

zhisheng17 commented Dec 2, 2024

zhisheng17 commented Dec 2, 2024

zaynt4606 commented Nov 27, 2024 •

edited

Loading

zaynt4606 commented Nov 27, 2024 •

edited

Loading

turboFei commented Nov 27, 2024 •

edited

Loading

zaynt4606 commented Nov 28, 2024 •

edited

Loading

zaynt4606 commented Dec 2, 2024 •

edited

Loading