-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreachable providers for popular CIDs #49
Comments
On disabling reproviding of third-party blocks on ephemeral nodesDoable, but a bigger lift than just setting Reprovider.Interval to 0. Today, block reproviding is a global flag in Kubo (IPFS Desktop, Brave): we do not distinguish between blocks fetched while browsing websites (temporarily stored in the cache), and blocks imported by user by adding their own data to local node (either pinned, in MFS, or just in cache). Both types of data are stored and reprovided by the same code paths, and we can't rely on pinning and MFS to identify user data, because That is to say, disabling reproviding only for third-party content is not trivial: to only stop reproviding third-party website data, we would have to introduce separate datastores with different reproviding settings for first-party and third-party blocks in Kubo. A different, a bit simpler approach would be to keep a single datastore, but instead introduce a new default "auto" Reprovider.Strategy that:
|
@yiannisbot : great data - thanks for surfacing. @lidel : great ideas - thanks for responding. I like your proposal with B. |
Quick update, not directly related to the large number of "unreachable" peers, but rather related to the small number of "reachable" peers in the plot at the top. We're currently monitoring most/all of PL's websites and reporting about some of them here: https://github.com/protocol/network-measurements/tree/master/reports/2023/calendar-week-20/ipfs#website-monitoring. All of these websites are supposed to have at least two stable providers, who pin those websites' CIDs: i) Fleek, ii) PL's pinning cluster. Interestingly, digging a little deeper, it turns out that there are cases where either one, or both of these stable providers do not show as providers of the website CIDs. Pasting some screenshots below (no plots yet). I'll investigate further, but in the meantime:
drand.love providers |
Quick update: it turns out that all kubo versions before v0.19.1 have issues with reproviding. The Bifrost and Fleek teams have been informed and asked to update to We'll monitor the situation once their nodes have been updated and report back here before closing this issue. The number of reachable and unreachable providers is also being reported at probelab.io: https://probelab.io/websites/protocol.ai/#website-providers-protocolai (example for protocol.ai website). |
Is this issue described somewhere (in this issue or in a Kubo issue)? I'd like to make it clear what the problem is that has now been fixed. (I'm not remembering myself...) Thanks! |
@yiannisbot I have not followed this thread closely here are the three things on my mind: 1. Kubo
|
Excellent, thanks @Jorropo ! Basically, the issue that @BigLep was asking for is: ipfs/kubo#9722 |
Just confirming that both PL's pinning cluster and Fleek's fleet have upgraded to kubo-v0.20. @dennis-tra when you get a chance, can you produce tables similar to the ones further up (#49 (comment)) to make sure things are working as expected now? At this point, we'll be good to close this issue. As a separate item: do we plan to include these results (on the number of [fleek providers, pinning cluster providers]) as part of the individual websites (e.g., https://probelab.io/websites/filecoin.io/)? We discussed this before, but can't remember what we agreed. |
I think a graph like this would be great to have 👍 My concern was that we'll always be chasing the most recent PeerIDs to produce these graphs. E.g. now, fleek and PL's pinning cluster have upgraded to kubo-v0.20. Does that come with a new set of PeerIDs? I asked here to get info from Fleek. I think if there's no reliable and automatic way to get these PeerIDs, we won't have develop trust in these graphs. |
Our cluster peerIDs don't change between versions |
Thanks, @gmasgras, that's good! Is there a location (perhaps somewhere in an IaC repo) where I could verify this every now and then? Let's say you're scaling up your nodes/peers in a few months; then, we would miss the new peers if we don't get a ping from you or verify this ourselves. I think we can go ahead and produce such graphs that @yiannisbot asked for anyways. I just want to make preparations so that 1) we know if the set of peers has changed and 2) find out the updated list when we notice a change. |
Thanks @dennis-tra ! Let's monitor for a bit longer before closing the issue, unless we continue seeing zero providers from any of the two. |
@yiannisbot and @dennis-tra : I think I am still seeing related issues here. For example, today @momack2 reported to me that she couldn't load https://probelab.io/ with Companion/Kubo. I was able to personally, but I believe that is because I had the data cached from viewing the site last night. Digging a little deeper:
https://ipfs.io/ipfs/QmW12bCzQnDWcM9gzEuv7saJVdypCdopHQQRTcEnS6pBXK/ was also hanging. |
Thanks for the heads up! Depending on when exactly that was experienced, it was most likely due to some hiccups we had while making some changes to the website. During that period the site was not reachable more generally :) It should be back up now and load without problems. |
I found that an interesting test so i tried it too :)
On 2 different nodes i tried: The pl-diagnose tool (test 1. Is my content on the DHT?) is finding 6 peers:
Just reporting it here. Not sure if it's of value or just noise. |
Thanks Mark! I should note that the site was not configured to be served over IPFS at all until very recently (this week). Checking in more detail, it seems that it's still not pinned on PL's pinning cluster, as per: https://github.com/protocol/bifrost-infra/pull/2606 (private repo, sorry) and the request to pin it on Fleek's cluster has been submitted a couple of days ago (cc: @iand), so might still not have been acted upon. All that said, it can be the case that currently the site does not have stable providers on the IPFS DHT (which would also explain the large latencies seen over kubo at: https://probelab.io/websites/#websites-web-vitals-heatmap-KUBO-ttfb-p90). Bear with us :) |
I believe this was at the exact time we switched from Github pages to Fleek which involved a couple of hours of unavailability due to a DNS provider error. Prior to that the site was not accessible via IPFS. There were some announcements in #probe-lab on filecoin slack at the time. |
I'm not sure if that timing is coincidental or if something bigger is going on.
Both nodes do
and the other node:
It does look like the diagnose tool is having these same issue this time. Checking to gives me:
By which i assume it found peers providing the data but wasn't able to connect to any of them (because of the Again running:
Does show that we have a new CID now (
This does seem to hint at something bigger going on and the timing being just a coincidental thing. |
To be clear to anyone trying to help (thx btw), |
Tried that too, still empty returns. Out of curiosity a count of
|
Thanks for the input everyone! The fact that there are zero Fleek providers in many cases is not good. We need to investigate further with them. A couple of questions:
|
@yiannisbot anything i can do to clear things up? |
There was conversation on 2023-06-19 between probelab and some of the Kubo maintainers. We ultimately need to create some issues in Kubo related to code changes here. @dennis-tra is going to create these. @lidel : one thing I would like to make sure I have my handle on is how much does someone need to change the Kubo defaults to get into this position of providing all blocks. Let me know if my understanding below is incorrect... By default a fresh Kubo installation will use I basically want to make sure I understand how content like blog.libp2p.io has so many additional providers per https://probelab.io/websites/blog.libp2p.io/#website-trend-providers-bloglibp2pio Also, how are nodes getting into the "Reachable Relayed" bucket? Are these nodes that are explicitly turning on server mode even though they aren't directly dialable? For anyone watching this issue, I expect Kubo to address any resulting issues here in early Q3 as otherwise we're failing to meet a baseline practical usecase of "IPFS can be used to reliably host/serve static websites". |
Thanks for the update @BigLep. I couldn't make it to the meeting, but wanted to follow up with this. Here's a few questions on @lidel 's previous suggestion, which AFAIU is what we're going with.
👍
This requires identifying what we want "online for some time" to mean. We can take several directions based on the node's history of staying online. Was this discussed during the 2023-06-19 meeting? I'm not entirely sure what the item in the parenthesis implies exactly in terms of protocol changes.
I agree (B) is a better direction here. Going with (B) does it mean that we want to give the servers the option of overriding the default setting manually if they way to provide content? |
We should analyse the unreachable providers some more.
|
IMO a cleaner and more efficient solution would be to change the protocol and spec (non-breaking change), to include a The up-to-date DHT Servers would discard the provider records after the associated ttl if any, or after the default value if no ttl is provided. The client can be aware of its average uptime. A client that has a low average uptime, is likely to provide a small number of CIDs, and can thus have a low reprovide interval (ttl=10 min). Large content providers usually have very long average uptime, and thus wouldn't change their reproviding interval. Note that reproviding content more often for some set of node, will increase the DHT Servers load. |
A few things I think we need here to help with prioritization: ImpactHow much impact is this having (as in how much are users impacted by the fact that we have very few reachable providers for a website CID)? What metrics can we look at to see the impact? I assume we'd expect to see:
Is there some measuring within kad-dht or the Boxo/Kubo code we should do to see how many provider records we had to go through before got to a peer we could connect with (and measure how long that took)? I want to make sure that as we make improvements here, we have a metric that that improves that translates to direct user experience. Basically I want to substantiate "This significantly lengthens the resolution process to the extent that the whole operation potentially times out" in ipfs/kubo#9982 The main impact I'm aware of are the anecdotes of PL websites not loading for people who use Brave/Kubo or Companion/Kubo. This obviously isn't great, and I want to see it fixed. Knowing how widespread and serious the problem is helps determine how earnest our response is. How this ends up happening in practiceI would like to make sure we have a shared hypothesis of how we get into this state of many unreachable providers and providers behind relays. Is it what I wrote in #49 (comment) ? |
Quick clarification: do you mean with "this" the current situation or the proposed mitigations? In the following, I assume it's the current situation (which, given the rest of your response, seems to make more sense).
I think both are correct. However, one thing to consider is that probably many of the website lookups are resolved via Bitswap. AFAIK there's no way we could know which routing subsystem resolved the content when we query Kubo's gateway. So, I doubt that the retrieval errors would go down significantly (in our measurement). Similarly, the TTFB might also not be as affected by this for the same reason. I can't find the related issue, but I believe I've read somewhere that we want to decrease the number of connected peers for the brave-built-in Kubo node. This change will likely put a spotlight on the issue we're discussing here because we will decrease the chances of Bitswap probabilistically resolving the content.
True, this is just a hypothesis (I added this remark).
Same here. I believe (again, an assumption) that in most of these anecdotal cases, the local Kubo node was not directly connected to the website's provider. This means Bitswap couldn't directly resolve the content. If we share this hypothesis, we'd need to make sure this is also the case in our measurements. Then we might be able to measure if any mitigation has an impact on the retrieval times. However, to truly measure the impact we'd also need to know how often we are, by chance already connected to a website's provider and how often we are not. No idea how we could find this out.
This is probably something @lidel could best comment on? |
Replying with a few thoughts in separate comments, starting by @BigLep 's comments: @BigLep one important thing to distinguish here is that:
Some experiments we have discussed with the ProbeLab team before, but didn't do, as we deprioritised them are:
My personal opinion is that these results would be great to have. We deprioritised them earlier, because we didn't know if results would add value, but as it seems, they'll now be very valuable. |
I like the |
I think that's already in the brave ipfs node (lowWater/highWater defaults to 20/40 with a grace of 20s). |
Oh wow, thanks @markg85! That's even lower than in the issue that @yiannisbot has found: ipfs/kubo#9420 |
Hi guys. I appreciate the remarks, especially the comment about how bitswap discovery is likely smoothing thing over. I think you all have some good ideas. I personally won't be able to give more critical thought this week. It's not clear to me yet that we should drop other things that we're doing to prioritize this measurement exercise, but I leave it up to you all to determine if/how to apply resources here during this end-of-quarter and perf crunch time. Another thought is that given the above reasoning, once our measurement tooling peers to one of the providers with the content (Fleek or Collab cluster), then the DHT discovery is irrelevant. Given we do multiple runs per website, do all the websites as one task, and let things warmup per https://probelab.io/tools/tiros/ , I assume we quickly start getting into "bitswap discovery" vs. "DHT discovery" mode and thus aren't going to see the impact in our current measurements. |
@BigLep as mentioned earlier, the number 1 item that will allow us to give the right priority to this is what is the impact of the huge number of unreachable providers. This boils down to:
During the ProbeLab CoLo today, we've identified a few ways that would shed some light here. We'll do an effort estimate tomorrow and report back.
This is true, but for each run the node is restarted. So, it basically comes down to the question of how quickly after being spun up and making the first requests does the node connect to one of the stable providers. After that, indeed, it might get everything through Bitswap from that node. |
Copying over some comments from our Co-Lo: We talked about the following measurement methodology: We would have a set of lightweight nodes that support the DHT and Bitswap protocol. We would feed these nodes a static list of CIDs (or IPNS keys) and instruct them to look up provider records in the DHT. We do so by using the API
After we “probed” a CID, we will disconnect from all peers we had found in the provider records. You could argue that disconnecting is not really important because we’re explicitly using the DHT to resolve the CID so that Bitswap cannot interfere. However, when we probe the next CID and find a peer in the provider records that was also a peer from a previous CID, we would likely already be connected to that peer. This means the TTFB will be much shorter because we wouldn’t need to establish a connection first. The above is significantly different from our current measurement tools that it wouldn't make sense to squeeze in somehow. Therefore, we're proposing a new tool. However, we can reuse a lot of the code from either our existing infrastructure and these links that @guillaumemichel provided:
I expect this to take four days of uninterrupted work, which realistically means it's double that. This includes
|
@yiannisbot @dennis-tra : understood on the need for new measurement to determine the impact. Do we have an estimate on when this will be completed? |
Since the consequences of this don't seem to be catastrophic, this has been deprioritised for now in favour of the DHT work. We haven't put it on the plan, but I expect that it will be taken up and completed within Q3, or early Q4. Do you agree @dennis-tra ? |
We've recently started measuring the performance of PL websites over kubo. We've been presenting some of these results in our weekly reports and we're also now putting more results at probelab.io (e.g., https://probelab.io/websites/protocol.ai/ for protocol.ai). As a way to get more insight into why the performance is what it is, we have collected the number of providers for each one of them. That will enable us to see if, for instance, there are no providers for a site.
We've found an unexpected result, which might makes sense if one gives a deeper thought into it: there are a ton of unreachable providers for most of the websites we're monitoring as shown in the graph below for protocol.ai. Note that the stable providers for protocol.ai should be two, i.e., that's where we currently pin content.
This happens because clients fetch the site, re-provide it and then leave the network, leaving stale records behind. In turn, this means that popular content, which is supposed to be privileged due to the content addressing nature of IPFS, is basically disadvantaged because clients would have to contact tens of "would be" providers before they find one that is actually available.
I'm starting this issue to raise attention to the issue, which should be addressed asap, IMO. We've previously discussed in slack a couple of fixes, such as setting a TTL for provider records equal to the average uptime of the node publishing the provider record. However, this would be a breaking protocol change and would therefore not be easy to deploy before the Composable DHT is in place. Turning off reproviding (temporarily, until we have Composable DHT) could be another avenue to fix this issue.
Other ideas are more than welcome. Tagging people who contributed to the discussion earlier, or would likely have ideas, or be aware of previous discussion around this issue: @Jorropo @guillaumemichel @aschmahmann @lidel @dennis-tra
The text was updated successfully, but these errors were encountered: