Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block on Proof Courier Service Connection Attempt #1203

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ffranr
Copy link
Contributor

@ffranr ffranr commented Nov 19, 2024

This PR enhances the reliability of the proof courier service and the robustness of the proof transfer process:

  • Blocking Courier Service Connection: Ensures courier service connection attempts are blocking, preventing failures from premature connection usage and simplifying debugging.
  • Proof Transfer Resilience:
    • Implements logic to re-attempt proof transfers when backoff attempts are exhausted, avoiding delays until the next service restart.
    • Refines logging and error messages to improve debugging.

These updates strengthen the courier service's reliability and make proof transfers more fault-tolerant.

This commit ensures that the proof transfer ChainPorter state is
re-executed once proof transfer backoff attempts have been
exhausted. In the absence of this commit, the next opportunity for
re-attempting proof transfer would be when tapd restarts (pending
parcels are processed on startup).
@ffranr ffranr self-assigned this Nov 19, 2024
@coveralls
Copy link

coveralls commented Nov 19, 2024

Pull Request Test Coverage Report for Build 11953640765

Details

  • 0 of 60 (0.0%) changed or added relevant lines in 5 files are covered.
  • 15 unchanged lines in 6 files lost coverage.
  • Overall coverage decreased (-0.005%) to 41.231%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tapcfg/config.go 0 1 0.0%
itest/tapd_harness.go 0 2 0.0%
itest/test_harness.go 0 2 0.0%
tapfreighter/chain_porter.go 0 23 0.0%
proof/courier.go 0 32 0.0%
Files with Coverage Reduction New Missed Lines %
itest/test_harness.go 1 0.0%
proof/courier.go 1 9.33%
tapfreighter/chain_porter.go 1 0.0%
tapchannel/aux_leaf_signer.go 2 35.92%
commitment/tap.go 2 84.43%
universe/interface.go 8 53.68%
Totals Coverage Status
Change from base Build 11953454712: -0.005%
Covered Lines: 25553
Relevant Lines: 61975

💛 - Coveralls

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to the logic make sense, thanks for the added robustness!
Some questions around some of the chosen values, since the integration tests take quite a while longer now.

proof/courier.go Outdated Show resolved Hide resolved
tapcfg/config.go Outdated Show resolved Hide resolved
itest/tapd_harness.go Outdated Show resolved Hide resolved
itest/tapd_harness.go Outdated Show resolved Hide resolved
The change ensures that the courier service connection attempt is
blocking rather than synchronous. This prevents proof transfers from
failing due to attempts to use connections before they are fully
established, simplifying debugging.

Both the connection and transfer steps are part of the backoff
procedure, so failures in either step will trigger re-attempts.
This commit adds a new default value for the proof courier service
response timeout which was added in the previous commit.
Set the request timeout for the tapd harness universe courier service
to an appropriate value to ensure tests pass consistently.
Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM 🎉

Current CI failures are known flakes or fixed in another PR (LiT itest).

@Roasbeef
Copy link
Member

IIRC, the issue last time was that an address contained an invalid courier addr, so the proof could never be delivered, meaning the send was never completed.

Does this change resolve that? IIUC now, we'll just block when trying to connect, but after the send has already been confirmed on chain?

@ffranr
Copy link
Contributor Author

ffranr commented Nov 22, 2024

IIRC, the issue last time was that an address contained an invalid courier addr, so the proof could never be delivered, meaning the send was never completed.

Does this change resolve that? IIUC now, we'll just block when trying to connect, but after the send has already been confirmed on chain?

@Roasbeef The PR does not address the invalid courier address problem. This PR ensures that we don't try to use a proof courier service connection before it's ready and so potentially avoid an unnecessary backoff and log messages.

Copy link
Collaborator

@jharveyb jharveyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good! Just one note.

Also, for the third commit, the comment should say async not sync ?

proof/courier.go Show resolved Hide resolved
// ServiceRequestTimeout defines the maximum duration we'll wait for
// a courier service to handle our outgoing request during a connection
// attempt, or when delivering or retrieving a proof.
ServiceRequestTimeout time.Duration `long:"servicerequestimeout" description:"The maximum duration we'll wait for a courier service to handle our outgoing request during a connection attempt, or when delivering or retrieving a proof."`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the hashmail courier / every courier implementation have a similar timeout? And then use it on initialization.

For hashmail I think that would be in NewHashMailBox, you could add the same child ctx object as you're doing for the Universe courier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right! Nice catch. Hahsmail is blocking also now, so would be good to have that timeout.

@lightninglabs-deploy
Copy link

@ffranr, remember to re-request review from reviewers when ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

6 participants