Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "req-res protocol reliability" spec #18

Merged
merged 39 commits into from
Jul 15, 2024
Merged
Changes from 6 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
1383a91
init spec
weboko May 27, 2024
3edb500
add abstract and motivation
weboko Jun 3, 2024
7371dab
add draft suggestions
weboko Jun 3, 2024
8582646
add node health from status-go#4628
weboko Jun 24, 2024
d0827cf
finish draft
weboko Jun 26, 2024
3249e7d
up considerations:
weboko Jun 27, 2024
057c4f2
-m
weboko Jul 1, 2024
35fd44f
remove wonted word
weboko Jul 1, 2024
e5ba255
typo
weboko Jul 1, 2024
f2802e7
typo
weboko Jul 1, 2024
174dffb
typo
weboko Jul 1, 2024
d5abdb3
address small comments
weboko Jul 1, 2024
ba103d2
improve abstract and motivation, add definitions
weboko Jul 2, 2024
9a4bb3d
remove relay
weboko Jul 2, 2024
c23dfe9
update node health section
weboko Jul 2, 2024
ef1c08a
define health section better
weboko Jul 2, 2024
988a022
improve peer and connection managment section
weboko Jul 2, 2024
9eb63db
improve peer and connection managment section
weboko Jul 2, 2024
594a1ec
improve service node pool sub section
weboko Jul 2, 2024
6a14926
finalazi connection managment section
weboko Jul 3, 2024
1f433b9
up security section
weboko Jul 3, 2024
7e030f8
remove ambiguity
weboko Jul 3, 2024
7ba0897
update light push section
weboko Jul 7, 2024
f2b4a77
update light push section
weboko Jul 7, 2024
67462e7
define failure of service node
weboko Jul 9, 2024
b264454
initial update to Filter
weboko Jul 9, 2024
092b18e
add failure definition
weboko Jul 9, 2024
59e225d
add regular pings to Filter
weboko Jul 9, 2024
fb3e69b
remove dupe with peer managment
weboko Jul 9, 2024
dc4d40f
finish filter section and other comments
weboko Jul 9, 2024
6296d12
service nit
weboko Jul 9, 2024
d8dcb53
add service node motivation
weboko Jul 9, 2024
642a539
add suggestion to health section
weboko Jul 9, 2024
dca926a
add nits
weboko Jul 9, 2024
0669283
improve failure definitions
weboko Jul 9, 2024
687d7e0
butch fix
weboko Jul 13, 2024
6436cdc
Merge branch 'master' of github.com:waku-org/specs into weboko/reliab…
weboko Jul 14, 2024
6a80762
update index and move to application
weboko Jul 14, 2024
5ad85b7
up
weboko Jul 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions informational/req-res-reliability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: REQ-RES-RELIABILITY
name: Request-response protocols reliability
weboko marked this conversation as resolved.
Show resolved Hide resolved
category: Best Current Practice
tags: [informational]
editor: Oleksandr Kozlov <[email protected]>
contributors:
- Prem Chaitanya Prathi <[email protected]>
- Danish Arora <[email protected]>
---

## Abstract
This RFC describes set of instructions used across different [WAKU2](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/10/waku2.md?plain=1#L3) implementations for improved reliability in request-response protocols such as [WAKU2-LIGHTPUSH](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/19/lightpush.md?plain=1#L3C11-L3C26) and [WAKU2-FILTER](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md?plain=1#L3).
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

Descriptions of mentioned protocols do not define some of the real world use cases that are oftenly observed in unreliable network environment. Such use cases can be: recovery from offline state, decrease rate of missed messages, increase probability of messages being broadcasted within the network.
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved

## Suggestions

### Node health

As a useful metric to define and implement for determining quality of provided service by a node:
weboko marked this conversation as resolved.
Show resolved Hide resolved
- unhealthy - no peer connections are available regardless of protocol;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am rethinking about health of a lightClient, maybe peer connections is not the right way.
Rather whether we have subscriptions for all filters active and if we have peers that we can send messages via lightpush. I will be working on it this week as to how do we define health in case of lightClient. Maybe we can update this based on the work that goes into it.

Copy link
Contributor Author

@weboko weboko Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's work together on that this week then, if it will take longer we can always follow up with a PR, especially considering it is something that we already have

relevant comment from @jm-clius - #18 (comment)

Copy link

@chaitanyaprem chaitanyaprem Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i have written down the approach for lightClients topic health here.
Please go through and leave your comments.

- minimally healthy:
- Relay has less than 4 peers connected;
weboko marked this conversation as resolved.
Show resolved Hide resolved
- Filter and LightPush has one per each peer connection available;
weboko marked this conversation as resolved.
Show resolved Hide resolved
- sufficiently healthy:
weboko marked this conversation as resolved.
Show resolved Hide resolved
- Relay has minimum 4 peers connected;
- more than 1 connection in Filter and at least 2 connections available in LightPush;
weboko marked this conversation as resolved.
Show resolved Hide resolved

### Peers and connection management

- Each protocols should retain a pool of reliable peers. In case a protocol failed to use any peer more than once - connection to it should be dropped and new peer should be added to the pool instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that this is client-side for req-resp protocols. "Peers" then become "service nodes" (req-resp is not really p2p).
"protocol failed to use" seems a bit vague to me. We're addressing service node reliability specifically. In other words "failure" would more precisely mean that the client node failed to either get a timely response from the service node or could determine that the provided service was not satisfactory, e.g. published message was altered, filter node missed some messages, etc. We can leave precise definition of "bad service" to the individual protocol specs.
"more than once" is too prescriptive. Use precise language to say client node can remove service nodes from the pool after n failures (per definition above). We can then continue to say that we recommend a very strict approach of removing service nodes after even 1 failure. (Do we really recommend this, though? If we fail to connect to a service node, should we completely remove it from the pool even if it might just be a temporary connection failure?)

Last point, specs should be as clear as possible re the agents involved in each interaction. In this case, for example, it's not really the "protocols" retaining the pool of peers, but the requesting node maintaining a pool of peers for each req-resp protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed here - 594a1ec

We can leave precise definition of "bad service" to the individual protocol specs.

do you see it being a follow up to this PR or, perhaps, I can add it as separate section that defines more precisely fails to serve protocol request (as I defined it rn)

We can then continue to say that we recommend a very strict approach of removing service nodes after even 1 failure. (Do we really recommend this, though? If we fail to connect to a service node, should we completely remove it from the pool even if it might just be a temporary connection failure?)

I think this change I did before we made actual code change so that's why it was more than once and I am changing it to fails to serve protocol request from a light node 3 times. I am not sure if it is best framing though

weboko marked this conversation as resolved.
Show resolved Hide resolved

- During discovery of new peers it is better to filter unwonted out based on ENR / multiaddress. For example in some cases `circuit-relay` addresses are not needed when we try to find and connect to peers directly.
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved

- When peer is discovered second time, we need to be sure to keep connection information up to date in Peer Store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean here? We discover an updated address for the same service node? In this case, I would rephrase to be explicit about this case (service node addresses should be kept up to date).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added shorter explanation but also keeping prev sentence to give additional context. wdyt @jm-clius ?
6a14926

weboko marked this conversation as resolved.
Show resolved Hide resolved

### Light Push

- While sending message with Light Push - it is advised to use more than 1 peer in order to increase changes of delivering message.
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved

- If sending message is failed to all of the peers - node should try to re-send message after some interval and continue doing so while OK response is received.
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved

### Filter

- To decrease chances of missing messages a node can initiate more than one subscription through Filter protocol to the same content topic and filter out duplicates. This will increase bandwidth consumption and would depend on the information exchanged under content topic in use.
weboko marked this conversation as resolved.
Show resolved Hide resolved

- In case a node goes offline while having an active subscription - it is important to do ping again right after node appears online. In case ping fails - re-subscribe request should be fired to a new peer.
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach, but i have seen that it is practically not possible because detecting a node went offline takes time ~20-30seconds in which time the peer ping might end up failing and causing subscriptions to get removed.
Also, note that Filter-ping doesn't have enough information to know which are the subscriptions currently active with a peer..it only indicates that there is a subscription with it.
Currently in status-go/go-waku the way it is implemented is as soon as offline status is identified all subscriptions are cleared off. When node comes back online new subscriptions are created.

At the same time, i am curious to know if we have taken the apporach mentioned here in js-waku what was the efficacy with node connectivity and did we not run into any issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detecting a node went offline takes time ~20-30seconds

that is not the case for browser, we can identify that it is offline almost instantly

When node comes back online new subscriptions are created.

that is interesting and this is what can kinda happen, as the described approach in the worst case will re-initiate all the subscriptions

did we not run into any issues

it is still experimental but on my local repros I observe the issue when subscription is dropped by a service node, that is happening somewhere after 10-15 mins.
Considering that we might just not do a ping if a node was offline for more than 10 minutes, but the problem here is it is not controllable behavior, meaning there is no CLI flag specified and it is not described in Filter RFC so it is totally up to an implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to that, I would define how to use store to recover.
Probably something like that (terminology can be improved):

  • timeout: the underlying protocol timeout. 30s for TCP, 1s/0s for websocket ("instant"). This caps how fast one can detect a disconnection
  • interval: time between filter pings
  • t0: successful ping
  • t1 (t0+interval): failed ping
  • t2: successful ping

At t2:

  • recreate subscription
  • trigger store v3 query to get msg id of missed messages.
    • either use last received message with timestamp < t0 as cursor (likely best practice)
    • or do a time based query: start: t0, end: t2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to know how long does it take for websocket to detect timeout, because iirc websockets use TCP as underlying transport same as what is used in other waku implementations as well.
One possibility could be that websocket is having some sort of a very regular heartbeat which makes it detect quickly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is still experimental but on my local repros I observe the issue when subscription is dropped by a service node, that is happening somewhere after 10-15 mins.

The filter timeout has been set in nwaku to be 5 minutes, so it should get dropped after 5 minutes +/- some buffer time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fryorcraken I would drop it out of this version of RFC
#18 (comment)

this sounds like a more complex feature that should be useful but we don't have any precedent

what if I or someone from the team implements it in js-waku as part of improving Filter and after it we can add it to this RFC?


- While registering Filter subscriptions - it is advised to batch requests for multiple content topics into one in order to reduce amount of queries sent to a node.
weboko marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible mechanism we should describe is regular "refresh" of subscriptions to ensure active subscriptions are still fully synchronised. Think of it as a much more occasional, more expensive "ping". Described here: https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe
We do not have to recommend using this mechanism, as it's quite expensive and I'm not sure we'll gain much by implementing it. However, it is a mechanism affecting reliability and I think it's useful to have the entire toolbox briefly described.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it partially covered with:

Right after such replace light node must create new subscription to newly connected service node as described in [Filter specification](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md).

but there is still room for improvement if following is true: can a subscription on a service node degrade over time / be dropped with regular pings? if so, is it fixable by just re-creating subscription by init query.

so, like service node is fine, just code responsible for Filter needs a bit of turn off and on

do you think it is viable, @jm-clius ?

- During creation of a new subscription it can be benefitial to use only new peers to which no subscriptions yet present and not use peers with which Filter already failed.
weboko marked this conversation as resolved.
Show resolved Hide resolved

## Security/Privacy Considerations

None of the mentioned recommendations incur privacy or security tradeoffs and in some cases increase k-anonymity (e.g having unique peers for Filter subscriptions).
weboko marked this conversation as resolved.
Show resolved Hide resolved

weboko marked this conversation as resolved.
Show resolved Hide resolved
## Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
Loading