-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate slow parachains #4569
Comments
I think we are still not at rock stable 12s blocks for parachains, which is what this plot also reveals. |
Yes indeed: https://grafana.teleport.parity.io/goto/TA1mEjPIR?orgId=1 some older data. Currently the introspector deployment is down, will be fixed tomorrow ASAP. (CC @AndreiEres) |
Regarding why this is happening, AFAIR parachains not always get stuff backed (collations not getting to backing group fast enough or at all) , but I am expecting this to be way better when they switch to async backing. |
Using such a long period is tricky, because you have to exclude all past solved incidents on both polkadot and the collator set which would contribute to this skewing.
Running on a network fully under our control where validators and collator have perfect connectivity get us to almost stable 12s and 6s(almost because availability cores are always cleared at session boundaries, so your going to miss a slot there). Our real networks are not in the perfect state, so blocks could be missed for a variety of reason(not saying this is what happens):
Yes, switching to async backing should alleviate some of the problems above, but I don't think all of them.
I will try to allocate some time in the following weeks, but I wouldn't expect a quick resolution on this one. |
Some more context/data for this, hoping it's helpful: Polkadot BridgeHubTook 24 seconds: https://bridgehub-polkadot.subscan.io/block/2883345 KusamaTook 12 seconds: https://kusama.subscan.io/block/23792945 |
Thank you @KarimJedda, I spent some time on this one as well, the approach I took here is with a modified version of the script presented here: #4569 (comment), I looked at all blocks that parity collators built for proposal on
If you add all those numbers at least for this snapshot with parity's collators the average for polkadot-asset-hub is 12.12s, but that's just our collators, the other collators might not be that well behaving, but it does correlates roughly with the polkadot asset hub clock skew in the past 10 days. ConclusionsLooking at all the data I've got available I don't think we've got any serious problem on polkadot, it is just that with sync-backing the 12s perfect block time it is something you hit in perfect conditions(validators, collators and network), the 2% delta we've got from perfect conditions is explained by the fact that a real network is not perfect. This will be slightly mitigated by the switch to async backing, but even there I don't think perfect 6s will be achieved, but there is a bit more redundancy. Some thoughtsMoving forward, at this margins(single digits divergence from the maximum theoretical performance) I think more close monitoring system needs to be implemented, to always find the mis-behaving validators/collators and notified them or kicked them out of the active set, it would be great if we would have some programmable tooling that could answer the following questions:
All of this data exists either on-chain, in the logs or the DHT, but it is not aggregated in a way where you can easily query it to get a full view of the network. For example, this one https://apps.turboflakes.io/?chain=polkadot#/insights is a good start of what I'm taking about, but it lacks the programability and is missing a lot of information for the above questions. |
Any idea why this happened? It's likely prioritization on the validator side? Any idea if all the validators are talking to one another though? 287/331 is way above what we'd notice in availability, but maybe we could've some query that says "validator x never puts validator y into its availability bitfields" Over a long enojugh time would say x cannot really talk to y, or at least their connection is very slow.. |
When` I wrote that comment it was just a theory, but while looking at parity's collators for blocks in #4569 (comment), there doesn't seem to be any connectivity issue, candidates are always advertised. Given the backing group is 5 on polkadot, I think we've got enough redundancy to alleviate any problems there. Connecting to 287 out of 331 authorithies is what subp2p-explorer running on my local machine reports, the current theory I discussed with @lexnv is that those are behind a NAT/Firewall and can't be reached from outside, the good thing is that does not mean they are not connected to other validators at all, because normally at each session change any two validators A & B will each try to connect to the other, so even if A can't reach B, B will be able to reach A. Prioritization could be an explanation as well.
Yeah, that's one of those things we should have tooling for and monitor if we want to reach the maximum parachain block throughput/ |
If you can not proof that they have done any misbehavior, they will not get kicked out of the set. I mean you can create social pressure by having a list of shame, but that's it. |
Yes, that's what I was thinking, when I said kick them out. |
Just fyi, there is an availability tit-for-tat game in my rewards proposal, which actually would reduce rewards for nodes behind NATs, so that's quite nice. |
Discussions in the Fellowship chat did not lead to any actionable outcome. Not sure whom to ping here or what the expected behaviour is.
Polkadot parachains are not producing as many blocks as they should. You can use this script to plot the skew (but it needs adjustments per chain).
The relay itself is fine:
But PAH is missing 30 days over the year:
And collectives:
The text was updated successfully, but these errors were encountered: