[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403

xiangguangyxg · 2024-07-16T01:50:41Z

Why I'm doing:

When exception occurs in hdfs fs manager, the current thread will be interrupted, causing InterruptedException or ClosedByInterruptException in subsequent code execution.

This may cause the following problem:
User backup data to a HDFS repository, and then releases the HDFS cluster. When FE starts, it will access the released HDFS cluster. If the access fails, the current thread will be interrupted, an InterruptedIOException or ClosedByInterruptException is thrown and the startup fails.

What I'm doing:

Avoid hdfs fs manager interrupting the thread by clearing the interrupted flag when exception occurs.
Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java

Smith-Cruise · 2024-07-16T06:45:24Z

fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java

@@ -708,6 +710,7 @@ private HdfsFs getFileSystemByCloudConfiguration(CloudConfiguration cloudConfigu
 }
 return fileSystem;
 } catch (Exception e) {
+ Thread.interrupted();


Can we try to unify the exception handler for all FileSystems

nshangyiming · 2024-07-16T09:33:16Z

fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java

@@ -518,6 +518,7 @@ public HdfsFs getDistributedFileSystem(String scheme, String path, Map<String, S
 }
 return fileSystem;
 } catch (Exception e) {
+ Thread.interrupted();


Is this the right way to handle the interuption event? Do we need to do something like retry? Always clear the interupted state even though there's no interuption happened? If we really need to eat the interupte exception, we should catch this type of exception explicitly and comment the reason i think.

If a exception is caught, the interrupted flag set by the exception should usually be reset to not affect the caller thread, otherwise it is meaningless to catch this exception, just rethrow it and let the caller to handle it will be better.
the retry is done in the underlying code, no need to retry here.
if let the interrupted flag set, any subsequent blocking code in this thread will throw exception, like sleep, wait, read, write ..., most of these blocking calls have been encapsulated, it's impossible to catch all the exceptions and retry in all blocking code,

kevincai

Still doubt if it is really caused by the thread interruption. This sounds to me like if a thread is interrupted for socket timeout, it won't be re-used again for further IO operation unless the interruption is cleared manually. Didn't hear about this when doing java IO programming.

A second thought is, even it is true, we'd better move the remote IO operation to a separate thread pool or thread while replying the image, so no matter what happens to the thread, it can be destroyed soon and won't affect the main thread at all.

kevincai · 2024-07-16T21:01:01Z

Have a second look into the source code of JDK about these ClosedByInterruptException, it means the thread is blocked by a blocking IO and the thread is interrupted from waiting for that operation, in such case the underlying IO maybe still on-the-fly or fails or whatever. It may not safe to just reset the interruption and allow redo IO operations on the same thread.

We shall pursue a different fix other than just reset the interruption flag.

xiangguangyxg · 2024-07-17T01:44:55Z

Have a second look into the source code of JDK about these ClosedByInterruptException, it means the thread is blocked by a blocking IO and the thread is interrupted from waiting for that operation, in such case the underlying IO maybe still on-the-fly or fails or whatever. It may not safe to just reset the interruption and allow redo IO operations on the same thread.

We shall pursue a different fix other than just reset the interruption flag.

Still doubt if it is really caused by the thread interruption. This sounds to me like if a thread is interrupted for socket timeout, it won't be re-used again for further IO operation unless the interruption is cleared manually. Didn't hear about this when doing java IO programming.

A second thought is, even it is true, we'd better move the remote IO operation to a separate thread pool or thread while replying the image, so no matter what happens to the thread, it can be destroyed soon and won't affect the main thread at all.

The interrupted flag should just be handled in the related blocking operations, should have no influence with unrelated blocking operations, I haven't figured out why it leaked.

xiangguangyxg · 2024-07-17T02:18:26Z

Have a second look into the source code of JDK about these ClosedByInterruptException, it means the thread is blocked by a blocking IO and the thread is interrupted from waiting for that operation, in such case the underlying IO maybe still on-the-fly or fails or whatever. It may not safe to just reset the interruption and allow redo IO operations on the same thread.

We shall pursue a different fix other than just reset the interruption flag.

The uncleaned interrupted flag is always a hidden danger, and any thread that calls the hdfs fs manager may be affected.

kevincai · 2024-07-17T04:26:45Z

some long long discussion about this AbstractInterruptibleChannel behavior in this link: https://news.ycombinator.com/item?id=35125962

This issue should be fixed, but maybe in a different way. Concerning on the unknown impacts of clearing of the thread interruption.

xiangguangyxg · 2024-07-17T11:39:52Z

some long long discussion about this AbstractInterruptibleChannel behavior in this link: https://news.ycombinator.com/item?id=35125962

This issue should be fixed, but maybe in a different way. Concerning on the unknown impacts of clearing of the thread interruption.

any suggestion about how to fix the issue ?

kevincai · 2024-07-17T17:17:26Z

some long long discussion about this AbstractInterruptibleChannel behavior in this link: https://news.ycombinator.com/item?id=35125962
This issue should be fixed, but maybe in a different way. Concerning on the unknown impacts of clearing of the thread interruption.

any suggestion about how to fix the issue ?

still search and study, try to get a fully understand of the AbstractInterruptibleChannel thing.

Signed-off-by: xiangguangyxg <[email protected]>

sonarcloud · 2024-07-19T06:14:34Z

Quality Gate passed

Issues
13 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

github-actions · 2024-07-19T07:44:33Z

[FE Incremental Coverage Report]

❌ fail : 0 / 52 (00.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/fs/hdfs/HdfsFsManager.java	0	52	00.00%	[1204, 1205, 1206, 1207, 1254, 1255, 1256, 1257, 1271, 1272, 1273, 1274, 1304, 1305, 1306, 1308, 1321, 1322, 1323, 1324, 1342, 1343, 1344, 1345, 1358, 1359, 1360, 1362, 1374, 1375, 1376, 1378, 1406, 1407, 1408, 1409, 1426, 1427, 1428, 1429, 1451, 1452, 1453, 1454, 1471, 1472, 1473, 1475, 1489, 1490, 1491, 1493]

github-actions · 2024-07-19T07:44:41Z

[BE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2024-07-22T11:35:27Z

@Mergifyio backport branch-2.5

mergify · 2024-07-22T11:35:35Z

backport branch-3.3

✅ Backports have been created

#48695 [BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) has been created for branch branch-3.3

mergify · 2024-07-22T11:35:36Z

backport branch-3.2

✅ Backports have been created

#48696 [BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) has been created for branch branch-3.2

mergify · 2024-07-22T11:35:37Z

backport branch-3.1

✅ Backports have been created

#48697 [BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) has been created for branch branch-3.1 but encountered conflicts

mergify · 2024-07-22T11:35:38Z

backport branch-3.0

✅ Backports have been created

#48698 [BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) has been created for branch branch-3.0 but encountered conflicts

mergify · 2024-07-22T11:35:39Z

backport branch-2.5

✅ Backports have been created

#48699 [BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) has been created for branch branch-2.5 but encountered conflicts

… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac)

… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac) # Conflicts: # fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java

… occurs (backport #48403) (#48695) Co-authored-by: xiangguangyxg <[email protected]>

… occurs (backport #48403) (#48696) Co-authored-by: xiangguangyxg <[email protected]>

… occurs (backport #48403) (#48697) Signed-off-by: xiangguangyxg <[email protected]> Co-authored-by: xiangguangyxg <[email protected]>

… occurs (backport #48403) (#48698) Signed-off-by: xiangguangyxg <[email protected]> Co-authored-by: xiangguangyxg <[email protected]>

… occurs (backport #48403) (#48699) Signed-off-by: xiangguangyxg <[email protected]> Co-authored-by: xiangguangyxg <[email protected]>

… occurs (StarRocks#48403) Signed-off-by: xiangguangyxg <[email protected]>

github-actions bot added the 3.3 label Jul 16, 2024

mergify bot assigned xiangguangyxg Jul 16, 2024

github-actions bot added 3.2 3.1 3.0 2.5 labels Jul 16, 2024

starrocks-cr bot reviewed Jul 16, 2024

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java Outdated Show resolved Hide resolved

Smith-Cruise approved these changes Jul 16, 2024

View reviewed changes

Smith-Cruise reviewed Jul 16, 2024

View reviewed changes

nshangyiming reviewed Jul 16, 2024

View reviewed changes

kevincai reviewed Jul 16, 2024

View reviewed changes

xiangguangyxg closed this Jul 17, 2024

xiangguangyxg reopened this Jul 17, 2024

xiangguangyxg force-pushed the fix_no_existed_hdfs branch 2 times, most recently from d74231f to 04f369d Compare July 19, 2024 03:28

kevincai approved these changes Jul 19, 2024

View reviewed changes

[BugFix] Avoid hdfs fs manager interrupting the thread on exception

424efcc

Signed-off-by: xiangguangyxg <[email protected]>

xiangguangyxg force-pushed the fix_no_existed_hdfs branch from 04f369d to 424efcc Compare July 19, 2024 06:07

xiangguangyxg requested a review from nshangyiming July 22, 2024 06:56

nshangyiming approved these changes Jul 22, 2024

View reviewed changes

wanpengfei-git merged commit 3fcafac into StarRocks:main Jul 22, 2024
46 of 47 checks passed

github-actions bot removed the 3.0 label Jul 22, 2024

github-actions bot removed the 2.5 label Jul 22, 2024

mergify bot pushed a commit that referenced this pull request Jul 22, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception…

733450d

… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac)

mergify bot mentioned this pull request Jul 22, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) #48695

Merged

42 tasks

mergify bot pushed a commit that referenced this pull request Jul 22, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception…

5f883a2

… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac)

mergify bot mentioned this pull request Jul 22, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) #48696

Merged

42 tasks

mergify bot mentioned this pull request Jul 22, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) #48697

Merged

42 tasks

mergify bot mentioned this pull request Jul 22, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) #48698

Merged

42 tasks

mergify bot mentioned this pull request Jul 22, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs (backport #48403) #48699

Merged

42 tasks

wanpengfei-git pushed a commit that referenced this pull request Jul 23, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception…

5753bee

… occurs (backport #48403) (#48695) Co-authored-by: xiangguangyxg <[email protected]>

github-actions bot added the 3.3-merged label Jul 23, 2024

wanpengfei-git pushed a commit that referenced this pull request Jul 23, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception…

875702f

… occurs (backport #48403) (#48696) Co-authored-by: xiangguangyxg <[email protected]>

github-actions bot added the 3.2-merged label Jul 23, 2024

github-actions bot added the 3.1-merged label Jul 23, 2024

github-actions bot added the 3.0-merged label Jul 23, 2024

github-actions bot added the 2.5-merged label Jul 23, 2024

xiangguangyxg deleted the fix_no_existed_hdfs branch July 23, 2024 02:35

dujijun007 pushed a commit to dujijun007/starrocks that referenced this pull request Jul 29, 2024

[BugFix] Avoid hdfs fs manager interrupting the thread when exception…

8487ab5

… occurs (StarRocks#48403) Signed-off-by: xiangguangyxg <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403

xiangguangyxg commented Jul 16, 2024 •

edited

Loading

Smith-Cruise Jul 16, 2024

xiangguangyxg Jul 16, 2024

nshangyiming Jul 16, 2024

xiangguangyxg Jul 16, 2024 •

edited

Loading

kevincai left a comment •

edited

Loading

kevincai commented Jul 16, 2024

xiangguangyxg commented Jul 17, 2024 •

edited

Loading

xiangguangyxg commented Jul 17, 2024

kevincai commented Jul 17, 2024

xiangguangyxg commented Jul 17, 2024

kevincai commented Jul 17, 2024

sonarcloud bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 22, 2024

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403

Conversation

xiangguangyxg commented Jul 16, 2024 • edited Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Smith-Cruise Jul 16, 2024

Choose a reason for hiding this comment

xiangguangyxg Jul 16, 2024

Choose a reason for hiding this comment

nshangyiming Jul 16, 2024

Choose a reason for hiding this comment

xiangguangyxg Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

kevincai left a comment • edited Loading

Choose a reason for hiding this comment

kevincai commented Jul 16, 2024

xiangguangyxg commented Jul 17, 2024 • edited Loading

xiangguangyxg commented Jul 17, 2024

kevincai commented Jul 17, 2024

xiangguangyxg commented Jul 17, 2024

kevincai commented Jul 17, 2024

sonarcloud bot commented Jul 19, 2024

Quality Gate passed

github-actions bot commented Jul 19, 2024

[FE Incremental Coverage Report]

file detail

github-actions bot commented Jul 19, 2024

[BE Incremental Coverage Report]

github-actions bot commented Jul 22, 2024

mergify bot commented Jul 22, 2024 • edited Loading

✅ Backports have been created

mergify bot commented Jul 22, 2024 • edited Loading

✅ Backports have been created

mergify bot commented Jul 22, 2024 • edited Loading

✅ Backports have been created

mergify bot commented Jul 22, 2024 • edited Loading

✅ Backports have been created

mergify bot commented Jul 22, 2024 • edited Loading

✅ Backports have been created

xiangguangyxg commented Jul 16, 2024 •

edited

Loading

xiangguangyxg Jul 16, 2024 •

edited

Loading

kevincai left a comment •

edited

Loading

xiangguangyxg commented Jul 17, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading

mergify bot commented Jul 22, 2024 •

edited

Loading