Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "req-res protocol reliability" spec #18

Merged
merged 39 commits into from
Jul 15, 2024
Merged
Changes from 35 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
1383a91
init spec
weboko May 27, 2024
3edb500
add abstract and motivation
weboko Jun 3, 2024
7371dab
add draft suggestions
weboko Jun 3, 2024
8582646
add node health from status-go#4628
weboko Jun 24, 2024
d0827cf
finish draft
weboko Jun 26, 2024
3249e7d
up considerations:
weboko Jun 27, 2024
057c4f2
-m
weboko Jul 1, 2024
35fd44f
remove wonted word
weboko Jul 1, 2024
e5ba255
typo
weboko Jul 1, 2024
f2802e7
typo
weboko Jul 1, 2024
174dffb
typo
weboko Jul 1, 2024
d5abdb3
address small comments
weboko Jul 1, 2024
ba103d2
improve abstract and motivation, add definitions
weboko Jul 2, 2024
9a4bb3d
remove relay
weboko Jul 2, 2024
c23dfe9
update node health section
weboko Jul 2, 2024
ef1c08a
define health section better
weboko Jul 2, 2024
988a022
improve peer and connection managment section
weboko Jul 2, 2024
9eb63db
improve peer and connection managment section
weboko Jul 2, 2024
594a1ec
improve service node pool sub section
weboko Jul 2, 2024
6a14926
finalazi connection managment section
weboko Jul 3, 2024
1f433b9
up security section
weboko Jul 3, 2024
7e030f8
remove ambiguity
weboko Jul 3, 2024
7ba0897
update light push section
weboko Jul 7, 2024
f2b4a77
update light push section
weboko Jul 7, 2024
67462e7
define failure of service node
weboko Jul 9, 2024
b264454
initial update to Filter
weboko Jul 9, 2024
092b18e
add failure definition
weboko Jul 9, 2024
59e225d
add regular pings to Filter
weboko Jul 9, 2024
fb3e69b
remove dupe with peer managment
weboko Jul 9, 2024
dc4d40f
finish filter section and other comments
weboko Jul 9, 2024
6296d12
service nit
weboko Jul 9, 2024
d8dcb53
add service node motivation
weboko Jul 9, 2024
642a539
add suggestion to health section
weboko Jul 9, 2024
dca926a
add nits
weboko Jul 9, 2024
0669283
improve failure definitions
weboko Jul 9, 2024
687d7e0
butch fix
weboko Jul 13, 2024
6436cdc
Merge branch 'master' of github.com:waku-org/specs into weboko/reliab…
weboko Jul 14, 2024
6a80762
update index and move to application
weboko Jul 14, 2024
5ad85b7
up
weboko Jul 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions informational/req-res-reliability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
title: REQ-RES-RELIABILITY
name: Request-response protocols reliability
weboko marked this conversation as resolved.
Show resolved Hide resolved
category: Best Current Practice
tags: [informational]
editor: Oleksandr Kozlov <[email protected]>
contributors:
- Prem Chaitanya Prathi <[email protected]>
- Danish Arora <[email protected]>
---

## Abstract
This RFC describes set of instructions used across different [WAKU2](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/10/waku2.md) implementations for improved reliability during usage of request-response protocols by a light node:
weboko marked this conversation as resolved.
Show resolved Hide resolved
weboko marked this conversation as resolved.
Show resolved Hide resolved
- [WAKU2-LIGHTPUSH](../standards/core/lightpush.md) - is used for sending messages;
- [WAKU2-FILTER](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md) - is used for receiving messages;

### Definitions
- Service node - provides services to other nodes such as relaying messages send by LightPush to the network or service messages from the network through Filter, usually serves responses;
weboko marked this conversation as resolved.
Show resolved Hide resolved
- Light node - connects to and uses one or more service nodes via LightPush and/or Filter protocols, usually sends requests;
- Service node failure - can mean various things depending on the protocol in use:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this just be a "Service failure" - to me a node failure implies some issue with the node itself (such as bad config, unreachability, etc.". Service failure can include both node and protocol failures

- generic protocol failure - request is timed out or failed without error codes;
- LightPush specific failure - refer to [error codes](../standards/core/lightpush.md#examples-of-possible-error-codes) and consider request a failure when it is clear that service node cannot serve any future request, for example when service node does not have any peers to relay and returns `NO_PEERS_TO_RELAY`;
- Filter specific failure - we consider service node failing when it cannot serve [subscribe](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe) or [ping](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) request with OK status;

## Motivation

Specifications of the mentioned protocols do not define some of the real world use cases that are often observed in unreliable network environment from the perspective of light nodes that are consumers of LightPush and/or Filter protocols.
Such use cases can be: recovery from offline state, decrease rate of missed messages, increase probability of messages being broadcasted within the network, unreliability of the service node in use.
weboko marked this conversation as resolved.
Show resolved Hide resolved

## Suggestions

### Node health

Node health is a metric meant to determine the connectivity state of a light node and its present ability to reliably send and receive messages from the network.
We consider this reliability to be dependant on amount of simultaneous connections to responsive service nodes.
weboko marked this conversation as resolved.
Show resolved Hide resolved
Unfortunately the more connections light node establishes - the more bandwidth is consumed.
To address this we suggest following states:
- unhealthy - no connections to service nodes are available regardless of protocol;
- minimally healthy:
- Filter has one service node connection;
- LightPush protocol has one service node connection;
- sufficiently healthy:
weboko marked this conversation as resolved.
Show resolved Hide resolved
- Filter has at least 2 connections available to service nodes;
- LightPush has at least 2 connections available to service nodes;

### Peers and connection management

#### Pool of reliable service nodes
Light nodes should maintain a pool of reliable service nodes for each protocol.
In case service node [fails](./req-res-reliability.md#definitions) to serve protocol request -
light node should drop connection to it and a new service node should be connected and added to the pool instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a blanket rule here (any failure = replace service node), but just below we qualify which types of failures should be considered serious enough to look for another service node. Perhaps this clause can rather be something like "In case a service node fails to serve...light node MAY drop the connection... We RECOMMEND that service nodes be replaced when the failure conditions below are met."


We advice to replace service node for LightPush right after first failure in case:
- connection to it is lost or request timed out;
- it's response contains [error codes](../standards/core/lightpush.md#examples-of-possible-error-codes): `UNSUPPORTED_PUBSUB_TOPIC`, `INTERNAL_SERVER_ERROR` or `NO_PEERS_TO_RELAY`;
weboko marked this conversation as resolved.
Show resolved Hide resolved
- request failed but without error message returned;

For Filter we'd recommend replacing service node:
- [request for subscription](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe) so it cannot be initiated;
weboko marked this conversation as resolved.
Show resolved Hide resolved
- [ping](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) failed 2 times in a row;
danisharora099 marked this conversation as resolved.
Show resolved Hide resolved

#### Selection of discovered service nodes
During discovery light node should filter out service nodes based on preferences before establishing connection.
These preferences might include:
- [Libp2p multiadresses](https://github.com/libp2p/specs/blob/master/addressing/README.md) of a service node;
- Waku or libp2p protocols that a service node implements;
- Wakus shards that a service node is part of;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good concept.

However, I'd would have 2 types of filter:

  • strict
  • preference

And focus only required for now

strict:

  • right cluster
  • shards of interest
  • waku protocols of interest
  • supported transport protocols

preference:

  • ordered by latency (lowest preferred)
  • ordered by protocol version
  • ordered by recorded QoS (node with positive local reputation preferred)
  • ordered by transport protocols (node with direct connection [> circuit-relay], better protocol [WebTransport > WebSocket] preferred)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added for strict

for preference - we need to somehow expose this information to be available, ideally, before connection is established, for some it is there already but - I believe it is tricky for others like latency and reputation.

also, this is useful to have for incentivisation, but then there should be some guarantee (validators?)
sound like a new feature :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice direction!

we need to somehow expose this information to be available, ideally, before connection is established

It's tricky to do this because protocol handshake & information exchange happens once there is a connection with the service node. We already wait for information like shards & cluster, supported protocols (from the strict portion) until a connection is established with the service node.


More details about discovery can be found at [WAKU2 Discovery domain](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/10/waku2.md#discovery-domain) or [RELAY-SHARDING Discovery](https://github.com/waku-org/specs/blob/master/standards/core/relay-sharding.md#discovery).

Examples of filtering:
- When light node discovers service nodes that implement needed Waku protocols - it should prioritize those that implement most recent version of protocol;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in time, we should aim for single version protocol support in light clients IMO.

We can look into multiple version support as the network grows and we have less control on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is still worth mentioning as it might be not only version but needed protocol depending on the strategies used: Store + LightPush or Filter + LightPush or all

- Light node must connect only to those service nodes that participate in needed shard and cluster;
- Light node must use only those service nodes that implement needed transport protocols;
- When [Circuit V2](https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md) multi-addresses discovered by a light node - it should prefer other service nodes that can be connected directly if possible;

#### Continuous discovery
Light nodes must keep information about service nodes up to date.
For example when a service node is discovered second time,
we need to be sure to keep connection information up to date in Peer Store.

Information that is important to be up to date:
- [ENR](../standards/core/enr.md) information;
- [Libp2p multiaddresses](https://github.com/libp2p/specs/blob/master/addressing/README.md);

### LightPush

#### Sending with redundancy
To improve chances of delivery of messages light node can attempt sending same message via LightPush to 2 or more service nodes at the same time.
While doing so it is important to note that bandwidth consumption increases proportionally to amount of additional service nodes used.
Our advice to use 2 service nodes at a time.
chaitanyaprem marked this conversation as resolved.
Show resolved Hide resolved

#### Retry on failure
When light node sends a message it must await for LightPush response from service node and check it for [possible error codes](../standards/core/lightpush.md#examples-of-possible-error-codes).
In case request failed without error code or response contains errors that can be temporary for service node (e.g `TOO_MANY_REQUESTS`) -
weboko marked this conversation as resolved.
Show resolved Hide resolved
light node should try to re-send message after some interval and continue doing so until OK response is received or canceled.
Interval time can be arbitrary but we recommend starting with 1 second and increasing it on each failure during LightPush send.
Important to note that [per another recommendation](./req-res-reliability.md#pool-of-reliable-service-nodes) - light node should replace failing service node with another within pool of service nodes used by LightPush.

#### Retry missing messages
Light node can verity that network that is used at the moment has seen messages that were sent via LightPush earlier.
weboko marked this conversation as resolved.
Show resolved Hide resolved
In order to do that light node should use [Store protocol](../standards/core/store.md) or [Filter protocol](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md) to a different than the one used for LightPush service node.
weboko marked this conversation as resolved.
Show resolved Hide resolved

By using Store protocol light node can query any service node that implements Store protocol and see if the messages that were sent in the past period were seen.
Due to [Store message eligibility](https://github.com/waku-org/specs/blob/master/standards/core/store.md#waku-message-store-eligibility) only some of the messages will be stored so there is a limit as to which messages can be verified by Store queries.
Our advice to do periodic Store queries once per 30 seconds.

By using Filter protocol's active [subscription](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#filter-push) light node can verify that message that was sent through LightPush was seen by another service node in the network.
Filter protocol does not have such limitation as to type of messages received with subscription
but active subscription does not allow to see messages exchanged in the network while light node was offline.

In case some of the messages were not verified by any of the previous methods - they should be re-sent by LightPush using different service node.

### Filter

#### Regular pings
To ensure that subscription is maintained by a service node and not closed - light node should do recurring [pings](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping).
Our advice for light node to send ping requests once per minute.
In case light node does not receive OK response or it times out 2 times - such service node should be replaced as part of maintenance of [pool of reliable service nodes](./req-res-reliability.md#pool-of-reliable-service-nodes).
Right after such replace light node must create new subscription to newly connected service node as described in [Filter specification](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md).

#### Redundant subscriptions for message loss mitigation
To mitigate possibility of messages not being delivered by a service node - we advice to consider using multiple Filter subscriptions.
weboko marked this conversation as resolved.
Show resolved Hide resolved
Light node can initiate two subscriptions to the same content topic but to different service nodes.
While receiving messages through two subscriptions - duplicates must be dropped by using [deterministic message hashing](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/14/message.md#deterministic-message-hashing).
Note that such approach increase bandwidth consumption proportionally to amount of extra subscriptions established and should be used with caution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible mechanism we should describe is regular "refresh" of subscriptions to ensure active subscriptions are still fully synchronised. Think of it as a much more occasional, more expensive "ping". Described here: https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe
We do not have to recommend using this mechanism, as it's quite expensive and I'm not sure we'll gain much by implementing it. However, it is a mechanism affecting reliability and I think it's useful to have the entire toolbox briefly described.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it partially covered with:

Right after such replace light node must create new subscription to newly connected service node as described in [Filter specification](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md).

but there is still room for improvement if following is true: can a subscription on a service node degrade over time / be dropped with regular pings? if so, is it fixable by just re-creating subscription by init query.

so, like service node is fine, just code responsible for Filter needs a bit of turn off and on

do you think it is viable, @jm-clius ?

#### Offline recoverability
Network state should be monitored by light node and in case it goes offline - [regular pings](./req-res-reliability.md#regular-pings) must be stopped.
When network connection returns light node should initiate [Filter ping](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) to service nodes in use.
In case those pings fail light node must replace service nodes following advice of [pool of reliable service nodes](./req-res-reliability.md#pool-of-reliable-service-nodes) without waiting for multiple failures.
weboko marked this conversation as resolved.
Show resolved Hide resolved

## Security/Privacy Considerations

See [WAKU2-ADVERSARIAL-MODELS](https://github.com/waku-org/specs/blob/master/informational/adversarial-models.md).

weboko marked this conversation as resolved.
Show resolved Hide resolved
## Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
Loading