Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "req-res protocol reliability" spec #18

Merged
merged 39 commits into from
Jul 15, 2024
Merged

Add "req-res protocol reliability" spec #18

merged 39 commits into from
Jul 15, 2024

Conversation

weboko
Copy link
Contributor

@weboko weboko commented May 27, 2024

This PR adds initial version of Reliability for Request-Response protocols.

In here you can find summary of either ongoing or already finished efforts from status-go and js-waku.

@weboko weboko marked this pull request as ready for review June 27, 2024 08:43
Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all a great start!

I've added some comments below. In general:

  • we have to define from which POV we're defining reliability and be explicit throughout - in this case the POV of the requester within req-resp protocols. We can use "client", "requester" or both, as long as we define these terms in the beginning and use them consistently throughout.
  • we should be more clear about tradeoffs affecting decisions such as number of connections to service nodes, failure thresholds, etc.
  • one section that's missing might be a possible interaction between lightpush and filter. For example, if I have an active filter subscription to a service node different from the service node I use for lightpush, couldn't I simply verify that I receive my own published messages thereby reducing need for redundant publish? There are some assumptions of course, such as overlapping subscription, but I think this is a valid reliability strategy.

informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved

- To decrease chances of missing messages a node can initiate more than one subscription through Filter protocol to the same content topic and filter out duplicates. This will increase bandwidth consumption and would depend on the information exchanged under content topic in use.

- In case a node goes offline while having an active subscription - it is important to do ping again right after node appears online. In case ping fails - re-subscribe request should be fired to a new peer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach, but i have seen that it is practically not possible because detecting a node went offline takes time ~20-30seconds in which time the peer ping might end up failing and causing subscriptions to get removed.
Also, note that Filter-ping doesn't have enough information to know which are the subscriptions currently active with a peer..it only indicates that there is a subscription with it.
Currently in status-go/go-waku the way it is implemented is as soon as offline status is identified all subscriptions are cleared off. When node comes back online new subscriptions are created.

At the same time, i am curious to know if we have taken the apporach mentioned here in js-waku what was the efficacy with node connectivity and did we not run into any issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detecting a node went offline takes time ~20-30seconds

that is not the case for browser, we can identify that it is offline almost instantly

When node comes back online new subscriptions are created.

that is interesting and this is what can kinda happen, as the described approach in the worst case will re-initiate all the subscriptions

did we not run into any issues

it is still experimental but on my local repros I observe the issue when subscription is dropped by a service node, that is happening somewhere after 10-15 mins.
Considering that we might just not do a ping if a node was offline for more than 10 minutes, but the problem here is it is not controllable behavior, meaning there is no CLI flag specified and it is not described in Filter RFC so it is totally up to an implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to that, I would define how to use store to recover.
Probably something like that (terminology can be improved):

  • timeout: the underlying protocol timeout. 30s for TCP, 1s/0s for websocket ("instant"). This caps how fast one can detect a disconnection
  • interval: time between filter pings
  • t0: successful ping
  • t1 (t0+interval): failed ping
  • t2: successful ping

At t2:

  • recreate subscription
  • trigger store v3 query to get msg id of missed messages.
    • either use last received message with timestamp < t0 as cursor (likely best practice)
    • or do a time based query: start: t0, end: t2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to know how long does it take for websocket to detect timeout, because iirc websockets use TCP as underlying transport same as what is used in other waku implementations as well.
One possibility could be that websocket is having some sort of a very regular heartbeat which makes it detect quickly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is still experimental but on my local repros I observe the issue when subscription is dropped by a service node, that is happening somewhere after 10-15 mins.

The filter timeout has been set in nwaku to be 5 minutes, so it should get dropped after 5 minutes +/- some buffer time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fryorcraken I would drop it out of this version of RFC
#18 (comment)

this sounds like a more complex feature that should be useful but we don't have any precedent

what if I or someone from the team implements it in js-waku as part of improving Filter and after it we can add it to this RFC?

### Node health

As a useful metric to define and implement for determining quality of provided service by a node:
- unhealthy - no peer connections are available regardless of protocol;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am rethinking about health of a lightClient, maybe peer connections is not the right way.
Rather whether we have subscriptions for all filters active and if we have peers that we can send messages via lightpush. I will be working on it this week as to how do we define health in case of lightClient. Maybe we can update this based on the work that goes into it.

Copy link
Contributor Author

@weboko weboko Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's work together on that this week then, if it will take longer we can always follow up with a PR, especially considering it is something that we already have

relevant comment from @jm-clius - #18 (comment)

Copy link

@chaitanyaprem chaitanyaprem Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i have written down the approach for lightClients topic health here.
Please go through and leave your comments.

informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved

## Abstract
This RFC describes set of instructions used across different [WAKU2](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/10/waku2.md) implementations for improved reliability during usage of request-response protocols by a light node:
- [WAKU2-LIGHTPUSH](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/19/lightpush.md) - is used for sending messages;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link towards files in this repo using relative path so it can render correctly on the website, in GitHub and locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned in #18 (comment) - some files do not exist in this repo

do we plan on moving them at any point? cc @jm-clius

until it is done - I see the only way is using persistent GitHub links to vacp2p repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the links on the GH specs are to the github repos and I'd suggest using the persistent links. Afaik, the website renderers replace the links to relative web resources. cc @jimstir to double check.

informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved

- While sending message with Light Push - it is advised to use more than 1 peer in order to increase chances of delivering message.

- If sending message is failed to all of the peers - node should try to re-send message after some interval and continue doing so until OK response is received.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should take the opportunity to define application concern (user of waku library) vs waku concern (reliablity in waku).

How much should waku res-send? At which point should it be handed over the applicaiton and by extension the end user?

One proposal was a temporal one: Waku owns the message to be sent for ~15s (under the 20s grace period) and attempt during this period to send the message and get peer-to-peer confirmation.

Also, we need to define "sending message failed" similarly to my previous comment: are we saying no response/error response from service node?
Or are we saying we were not able (via store or filter) to confirm the message was handed over to the network?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more definitions, this should be resolved


- To decrease chances of missing messages a node can initiate more than one subscription through Filter protocol to the same content topic and filter out duplicates. This will increase bandwidth consumption and would depend on the information exchanged under content topic in use.

- In case a node goes offline while having an active subscription - it is important to do ping again right after node appears online. In case ping fails - re-subscribe request should be fired to a new peer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to that, I would define how to use store to recover.
Probably something like that (terminology can be improved):

  • timeout: the underlying protocol timeout. 30s for TCP, 1s/0s for websocket ("instant"). This caps how fast one can detect a disconnection
  • interval: time between filter pings
  • t0: successful ping
  • t1 (t0+interval): failed ping
  • t2: successful ping

At t2:

  • recreate subscription
  • trigger store v3 query to get msg id of missed messages.
    • either use last received message with timestamp < t0 as cursor (likely best practice)
    • or do a time based query: start: t0, end: t2

informational/req-res-reliability.md Outdated Show resolved Hide resolved

- While registering Filter subscriptions - it is advised to batch requests for multiple content topics into one in order to reduce amount of queries sent to a node.

- During creation of a new subscription it can be beneficial to use only new peers to which no subscriptions yet present and not use peers with which Filter already failed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall we need to have some sort of local ordering (scoring?) of peers to decide which peers should be used for x based on various criteria

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is kinda scoring or some kind of pre arranged exchange of capabilities of a service node

we don't have any of these so maybe let's initiate discussion if we really want to have such a thing?

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is a good summary of available mechanisms for reliability and tradeoffs at play. I have added mostly minor comments, but this can be merged as a raw specification.

One thing that is still missing is that the protocol part of a spec should preferably be phrased the keywords MUST, SHOULD, RECOMMEND, etc. as described here: https://github.com/waku-org/specs/blob/master/template.md#wire-format-specification--syntax

I think this can be done in a follow-up or whenever this specification is revised.

informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
### Definitions
- Service node - provides services to other nodes such as relaying messages send by LightPush to the network or service messages from the network through Filter, usually serves responses;
- Light node - connects to and uses one or more service nodes via LightPush and/or Filter protocols, usually sends requests;
- Service node failure - can mean various things depending on the protocol in use:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this just be a "Service failure" - to me a node failure implies some issue with the node itself (such as bad config, unreachability, etc.". Service failure can include both node and protocol failures

informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
Light node can initiate two subscriptions to the same content topic but to different service nodes.
While receiving messages through two subscriptions - duplicates must be dropped by using [deterministic message hashing](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/14/message.md#deterministic-message-hashing).
Note that such approach increase bandwidth consumption proportionally to amount of extra subscriptions established and should be used with caution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible mechanism we should describe is regular "refresh" of subscriptions to ensure active subscriptions are still fully synchronised. Think of it as a much more occasional, more expensive "ping". Described here: https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe
We do not have to recommend using this mechanism, as it's quite expensive and I'm not sure we'll gain much by implementing it. However, it is a mechanism affecting reliability and I think it's useful to have the entire toolbox briefly described.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it partially covered with:

Right after such replace light node must create new subscription to newly connected service node as described in [Filter specification](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md).

but there is still room for improvement if following is true: can a subscription on a service node degrade over time / be dropped with regular pings? if so, is it fixable by just re-creating subscription by init query.

so, like service node is fine, just code responsible for Filter needs a bit of turn off and on

do you think it is viable, @jm-clius ?

Copy link
Contributor

@danisharora099 danisharora099 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here! LGTM:)

informational/req-res-reliability.md Outdated Show resolved Hide resolved
Comment on lines 50 to 53
These preferences might include:
- [Libp2p multiadresses](https://github.com/libp2p/specs/blob/master/addressing/README.md) of a service node;
- Waku or libp2p protocols that a service node implements;
- Wakus shards that a service node is part of;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice direction!

we need to somehow expose this information to be available, ideally, before connection is established

It's tricky to do this because protocol handshake & information exchange happens once there is a connection with the service node. We already wait for information like shards & cluster, supported protocols (from the strict portion) until a connection is established with the service node.

Copy link

@chaitanyaprem chaitanyaprem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Just added 1 comment regarding store queries to get messages when node is offline.

informational/req-res-reliability.md Outdated Show resolved Hide resolved
informational/req-res-reliability.md Outdated Show resolved Hide resolved
@weboko weboko merged commit 6e6be39 into master Jul 15, 2024
0 of 2 checks passed
@weboko weboko deleted the weboko/reliability branch July 15, 2024 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants