Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "req-res protocol reliability" spec #18
Add "req-res protocol reliability" spec #18
Changes from 6 commits
1383a91
3edb500
7371dab
8582646
d0827cf
3249e7d
057c4f2
35fd44f
e5ba255
f2802e7
174dffb
d5abdb3
ba103d2
9a4bb3d
c23dfe9
ef1c08a
988a022
9eb63db
594a1ec
6a14926
1f433b9
7e030f8
7ba0897
f2b4a77
67462e7
b264454
092b18e
59e225d
fb3e69b
dc4d40f
6296d12
d8dcb53
642a539
dca926a
0669283
687d7e0
6436cdc
6a80762
5ad85b7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am rethinking about health of a lightClient, maybe peer connections is not the right way.
Rather whether we have subscriptions for all filters active and if we have peers that we can send messages via lightpush. I will be working on it this week as to how do we define health in case of lightClient. Maybe we can update this based on the work that goes into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's work together on that this week then, if it will take longer we can always follow up with a PR, especially considering it is something that we already have
relevant comment from @jm-clius - #18 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, i have written down the approach for lightClients topic health here.
Please go through and leave your comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that this is client-side for req-resp protocols. "Peers" then become "service nodes" (req-resp is not really p2p).
"protocol failed to use" seems a bit vague to me. We're addressing service node reliability specifically. In other words "failure" would more precisely mean that the client node failed to either get a timely response from the service node or could determine that the provided service was not satisfactory, e.g. published message was altered, filter node missed some messages, etc. We can leave precise definition of "bad service" to the individual protocol specs.
"more than once" is too prescriptive. Use precise language to say client node can remove service nodes from the pool after
n
failures (per definition above). We can then continue to say that we recommend a very strict approach of removing service nodes after even1
failure. (Do we really recommend this, though? If we fail to connect to a service node, should we completely remove it from the pool even if it might just be a temporary connection failure?)Last point, specs should be as clear as possible re the agents involved in each interaction. In this case, for example, it's not really the "protocols" retaining the pool of peers, but the requesting node maintaining a pool of peers for each req-resp protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed here - 594a1ec
do you see it being a follow up to this PR or, perhaps, I can add it as separate section that defines more precisely
fails to serve protocol request
(as I defined it rn)I think this change I did before we made actual code change so that's why it was
more than once
and I am changing it tofails to serve protocol request from a light node 3 times
. I am not sure if it is best framing thoughThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here? We discover an updated address for the same service node? In this case, I would rephrase to be explicit about this case (service node addresses should be kept up to date).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, https://github.com/waku-org/specs/pull/18/files#r1658347246 seems relevant here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added shorter explanation but also keeping prev sentence to give additional context. wdyt @jm-clius ?
6a14926
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting approach, but i have seen that it is practically not possible because detecting a node went offline takes time ~20-30seconds in which time the peer ping might end up failing and causing subscriptions to get removed.
Also, note that Filter-ping doesn't have enough information to know which are the subscriptions currently active with a peer..it only indicates that there is a subscription with it.
Currently in status-go/go-waku the way it is implemented is as soon as offline status is identified all subscriptions are cleared off. When node comes back online new subscriptions are created.
At the same time, i am curious to know if we have taken the apporach mentioned here in js-waku what was the efficacy with node connectivity and did we not run into any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is not the case for browser, we can identify that it is offline almost instantly
that is interesting and this is what can kinda happen, as the described approach in the worst case will re-initiate all the subscriptions
it is still experimental but on my local repros I observe the issue when subscription is dropped by a service node, that is happening somewhere after 10-15 mins.
Considering that we might just not do a ping if a node was offline for more than 10 minutes, but the problem here is it is not controllable behavior, meaning there is no CLI flag specified and it is not described in Filter RFC so it is totally up to an implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to that, I would define how to use store to recover.
Probably something like that (terminology can be improved):
At t2:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious to know how long does it take for websocket to detect timeout, because iirc websockets use TCP as underlying transport same as what is used in other waku implementations as well.
One possibility could be that websocket is having some sort of a very regular heartbeat which makes it detect quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filter timeout has been set in nwaku to be 5 minutes, so it should get dropped after 5 minutes +/- some buffer time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fryorcraken I would drop it out of this version of RFC
#18 (comment)
this sounds like a more complex feature that should be useful but we don't have any precedent
what if I or someone from the team implements it in js-waku as part of improving Filter and after it we can add it to this RFC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible mechanism we should describe is regular "refresh" of subscriptions to ensure active subscriptions are still fully synchronised. Think of it as a much more occasional, more expensive "ping". Described here: https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe
We do not have to recommend using this mechanism, as it's quite expensive and I'm not sure we'll gain much by implementing it. However, it is a mechanism affecting reliability and I think it's useful to have the entire toolbox briefly described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it partially covered with:
specs/informational/req-res-reliability.md
Line 120 in 0669283
but there is still room for improvement if following is true: can a subscription on a service node degrade over time / be dropped with regular pings? if so, is it fixable by just re-creating subscription by init query.
so, like service node is fine, just code responsible for Filter needs a bit of turn off and on
do you think it is viable, @jm-clius ?