Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rabit] Improved connection handling. #9531

Merged
merged 11 commits into from
Aug 30, 2023

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Aug 29, 2023

  • Enable timeout with connect.
  • Report connection error from the system.
  • Handle retry for both tracker connection and peer connection.

- Enable timeout.
- Report connection error from the system.
- Handle retry for both tracker connection and peer connection.
* @brief An error type that's easier to handle than throwing dmlc exception. We can
* record and propagate the system error code.
*/
struct Result {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point in the future, we need to propagate the error to Python or other language bindings for delegating the error handling to higher-level frameworks like dask. At the moment, a functional form of error handling is easier to handle than exceptions.

@trivialfis
Copy link
Member Author

cc @rongou .

@trivialfis trivialfis merged commit ccfc90e into dmlc:master Aug 30, 2023
25 checks passed
@trivialfis trivialfis deleted the rabit-tracker-connect-timeout branch August 30, 2023 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants