Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

xiangguangyxg
Copy link
Contributor

@xiangguangyxg xiangguangyxg commented Jul 16, 2024

Why I'm doing:

When exception occurs in hdfs fs manager, the current thread will be interrupted, causing InterruptedException or ClosedByInterruptException in subsequent code execution.

This may cause the following problem:
User backup data to a HDFS repository, and then releases the HDFS cluster. When FE starts, it will access the released HDFS cluster. If the access fails, the current thread will be interrupted, an InterruptedIOException or ClosedByInterruptException is thrown and the startup fails.

What I'm doing:

Avoid hdfs fs manager interrupting the thread by clearing the interrupted flag when exception occurs.
Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@@ -708,6 +710,7 @@ private HdfsFs getFileSystemByCloudConfiguration(CloudConfiguration cloudConfigu
}
return fileSystem;
} catch (Exception e) {
Thread.interrupted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to unify the exception handler for all FileSystems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

later

@@ -518,6 +518,7 @@ public HdfsFs getDistributedFileSystem(String scheme, String path, Map<String, S
}
return fileSystem;
} catch (Exception e) {
Thread.interrupted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right way to handle the interuption event? Do we need to do something like retry? Always clear the interupted state even though there's no interuption happened? If we really need to eat the interupte exception, we should catch this type of exception explicitly and comment the reason i think.

Copy link
Contributor Author

@xiangguangyxg xiangguangyxg Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a exception is caught, the interrupted flag set by the exception should usually be reset to not affect the caller thread, otherwise it is meaningless to catch this exception, just rethrow it and let the caller to handle it will be better.
the retry is done in the underlying code, no need to retry here.
if let the interrupted flag set, any subsequent blocking code in this thread will throw exception, like sleep, wait, read, write ..., most of these blocking calls have been encapsulated, it's impossible to catch all the exceptions and retry in all blocking code,

Copy link
Contributor

@kevincai kevincai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still doubt if it is really caused by the thread interruption. This sounds to me like if a thread is interrupted for socket timeout, it won't be re-used again for further IO operation unless the interruption is cleared manually. Didn't hear about this when doing java IO programming.

A second thought is, even it is true, we'd better move the remote IO operation to a separate thread pool or thread while replying the image, so no matter what happens to the thread, it can be destroyed soon and won't affect the main thread at all.

@kevincai
Copy link
Contributor

Have a second look into the source code of JDK about these ClosedByInterruptException, it means the thread is blocked by a blocking IO and the thread is interrupted from waiting for that operation, in such case the underlying IO maybe still on-the-fly or fails or whatever. It may not safe to just reset the interruption and allow redo IO operations on the same thread.

We shall pursue a different fix other than just reset the interruption flag.

@xiangguangyxg
Copy link
Contributor Author

xiangguangyxg commented Jul 17, 2024

Have a second look into the source code of JDK about these ClosedByInterruptException, it means the thread is blocked by a blocking IO and the thread is interrupted from waiting for that operation, in such case the underlying IO maybe still on-the-fly or fails or whatever. It may not safe to just reset the interruption and allow redo IO operations on the same thread.

We shall pursue a different fix other than just reset the interruption flag.

Still doubt if it is really caused by the thread interruption. This sounds to me like if a thread is interrupted for socket timeout, it won't be re-used again for further IO operation unless the interruption is cleared manually. Didn't hear about this when doing java IO programming.

A second thought is, even it is true, we'd better move the remote IO operation to a separate thread pool or thread while replying the image, so no matter what happens to the thread, it can be destroyed soon and won't affect the main thread at all.

The interrupted flag should just be handled in the related blocking operations, should have no influence with unrelated blocking operations, I haven't figured out why it leaked.

@xiangguangyxg
Copy link
Contributor Author

Have a second look into the source code of JDK about these ClosedByInterruptException, it means the thread is blocked by a blocking IO and the thread is interrupted from waiting for that operation, in such case the underlying IO maybe still on-the-fly or fails or whatever. It may not safe to just reset the interruption and allow redo IO operations on the same thread.

We shall pursue a different fix other than just reset the interruption flag.

The uncleaned interrupted flag is always a hidden danger, and any thread that calls the hdfs fs manager may be affected.

@kevincai
Copy link
Contributor

some long long discussion about this AbstractInterruptibleChannel behavior in this link: https://news.ycombinator.com/item?id=35125962

This issue should be fixed, but maybe in a different way. Concerning on the unknown impacts of clearing of the thread interruption.

@xiangguangyxg
Copy link
Contributor Author

some long long discussion about this AbstractInterruptibleChannel behavior in this link: https://news.ycombinator.com/item?id=35125962

This issue should be fixed, but maybe in a different way. Concerning on the unknown impacts of clearing of the thread interruption.

any suggestion about how to fix the issue ?

@kevincai
Copy link
Contributor

some long long discussion about this AbstractInterruptibleChannel behavior in this link: https://news.ycombinator.com/item?id=35125962
This issue should be fixed, but maybe in a different way. Concerning on the unknown impacts of clearing of the thread interruption.

any suggestion about how to fix the issue ?

still search and study, try to get a fully understand of the AbstractInterruptibleChannel thing.

@xiangguangyxg xiangguangyxg force-pushed the fix_no_existed_hdfs branch 2 times, most recently from d74231f to 04f369d Compare July 19, 2024 03:28
Copy link

sonarcloud bot commented Jul 19, 2024

Copy link

[FE Incremental Coverage Report]

fail : 0 / 52 (00.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/fs/hdfs/HdfsFsManager.java 0 52 00.00% [1204, 1205, 1206, 1207, 1254, 1255, 1256, 1257, 1271, 1272, 1273, 1274, 1304, 1305, 1306, 1308, 1321, 1322, 1323, 1324, 1342, 1343, 1344, 1345, 1358, 1359, 1360, 1362, 1374, 1375, 1376, 1378, 1406, 1407, 1408, 1409, 1426, 1427, 1428, 1429, 1451, 1452, 1453, 1454, 1471, 1472, 1473, 1475, 1489, 1490, 1491, 1493]

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@wanpengfei-git wanpengfei-git merged commit 3fcafac into StarRocks:main Jul 22, 2024
46 of 47 checks passed
@github-actions github-actions bot removed the 3.0 label Jul 22, 2024
Copy link

@Mergifyio backport branch-2.5

@github-actions github-actions bot removed the 2.5 label Jul 22, 2024
Copy link
Contributor

mergify bot commented Jul 22, 2024

backport branch-3.3

✅ Backports have been created

Copy link
Contributor

mergify bot commented Jul 22, 2024

backport branch-3.2

✅ Backports have been created

Copy link
Contributor

mergify bot commented Jul 22, 2024

backport branch-3.1

✅ Backports have been created

Copy link
Contributor

mergify bot commented Jul 22, 2024

backport branch-3.0

✅ Backports have been created

Copy link
Contributor

mergify bot commented Jul 22, 2024

backport branch-2.5

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jul 22, 2024
… occurs (#48403)

Signed-off-by: xiangguangyxg <[email protected]>
(cherry picked from commit 3fcafac)
mergify bot pushed a commit that referenced this pull request Jul 22, 2024
… occurs (#48403)

Signed-off-by: xiangguangyxg <[email protected]>
(cherry picked from commit 3fcafac)
mergify bot pushed a commit that referenced this pull request Jul 22, 2024
… occurs (#48403)

Signed-off-by: xiangguangyxg <[email protected]>
(cherry picked from commit 3fcafac)

# Conflicts:
#	fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java
mergify bot pushed a commit that referenced this pull request Jul 22, 2024
… occurs (#48403)

Signed-off-by: xiangguangyxg <[email protected]>
(cherry picked from commit 3fcafac)

# Conflicts:
#	fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java
mergify bot pushed a commit that referenced this pull request Jul 22, 2024
… occurs (#48403)

Signed-off-by: xiangguangyxg <[email protected]>
(cherry picked from commit 3fcafac)

# Conflicts:
#	fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java
wanpengfei-git pushed a commit that referenced this pull request Jul 23, 2024
wanpengfei-git pushed a commit that referenced this pull request Jul 23, 2024
wanpengfei-git pushed a commit that referenced this pull request Jul 23, 2024
… occurs (backport #48403) (#48697)

Signed-off-by: xiangguangyxg <[email protected]>
Co-authored-by: xiangguangyxg <[email protected]>
wanpengfei-git pushed a commit that referenced this pull request Jul 23, 2024
… occurs (backport #48403) (#48698)

Signed-off-by: xiangguangyxg <[email protected]>
Co-authored-by: xiangguangyxg <[email protected]>
wanpengfei-git pushed a commit that referenced this pull request Jul 23, 2024
… occurs (backport #48403) (#48699)

Signed-off-by: xiangguangyxg <[email protected]>
Co-authored-by: xiangguangyxg <[email protected]>
@xiangguangyxg xiangguangyxg deleted the fix_no_existed_hdfs branch July 23, 2024 02:35
dujijun007 pushed a commit to dujijun007/starrocks that referenced this pull request Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants