Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreachable providers for popular CIDs #49

Open
yiannisbot opened this issue May 18, 2023 · 40 comments
Open

Unreachable providers for popular CIDs #49

yiannisbot opened this issue May 18, 2023 · 40 comments

Comments

@yiannisbot
Copy link
Member

We've recently started measuring the performance of PL websites over kubo. We've been presenting some of these results in our weekly reports and we're also now putting more results at probelab.io (e.g., https://probelab.io/websites/protocol.ai/ for protocol.ai). As a way to get more insight into why the performance is what it is, we have collected the number of providers for each one of them. That will enable us to see if, for instance, there are no providers for a site.

We've found an unexpected result, which might makes sense if one gives a deeper thought into it: there are a ton of unreachable providers for most of the websites we're monitoring as shown in the graph below for protocol.ai. Note that the stable providers for protocol.ai should be two, i.e., that's where we currently pin content.

website-providers-protocolai

This happens because clients fetch the site, re-provide it and then leave the network, leaving stale records behind. In turn, this means that popular content, which is supposed to be privileged due to the content addressing nature of IPFS, is basically disadvantaged because clients would have to contact tens of "would be" providers before they find one that is actually available.

I'm starting this issue to raise attention to the issue, which should be addressed asap, IMO. We've previously discussed in slack a couple of fixes, such as setting a TTL for provider records equal to the average uptime of the node publishing the provider record. However, this would be a breaking protocol change and would therefore not be easy to deploy before the Composable DHT is in place. Turning off reproviding (temporarily, until we have Composable DHT) could be another avenue to fix this issue.

Other ideas are more than welcome. Tagging people who contributed to the discussion earlier, or would likely have ideas, or be aware of previous discussion around this issue: @Jorropo @guillaumemichel @aschmahmann @lidel @dennis-tra

@lidel
Copy link

lidel commented May 18, 2023

On disabling reproviding of third-party blocks on ephemeral nodes

Doable, but a bigger lift than just setting Reprovider.Interval to 0.

Today, block reproviding is a global flag in Kubo (IPFS Desktop, Brave): we do not distinguish between blocks fetched while browsing websites (temporarily stored in the cache), and blocks imported by user by adding their own data to local node (either pinned, in MFS, or just in cache). Both types of data are stored and reprovided by the same code paths, and we can't rely on pinning and MFS to identify user data, because ipfs block put and ipfs dag put do not pin by default.

That is to say, disabling reproviding only for third-party content is not trivial: to only stop reproviding third-party website data, we would have to introduce separate datastores with different reproviding settings for first-party and third-party blocks in Kubo.
Content explicitly imported by the user (ipfs add, ipfs dag put, ipfs block put, ipfa dag import), or pinned by user, would be added/moved to first-party datastore.

A different, a bit simpler approach would be to keep a single datastore, but instead introduce a new default "auto" Reprovider.Strategy that:

  • always announces pinned content (+implicitly pinned MFS) → ensures user content is always reachable asap
  • announces the remaining blocks in cache (incl. ones that come from browsed websites) ONLY if a node was online for some time (we would add optionalDuration Reprovider.UnpinnedDelay to allow users to adjust the implicit default)
  • TBD how we solve ipfs dag put and ipfs block put or other user content that is not pinned, but expected to "work instantly"
    • (A) we could flip --pin in them to true → breaking change (may surprise users who expect these to not keep garbage around, may lead to services running out of disk space)
    • (B) we could say that the ability for users to set Reprovider.Strategy to all and/or adjust Reprovider.UnpinnedDelay are enough here, ipfs routing provide exists, we could add --all to allow apps/users to manually trigger provide before Reprovider.UnpinnedDelay hits. (feels safer than A, no DoS, worst case a delay in announce on a cold boot)

@BigLep
Copy link

BigLep commented May 20, 2023

@yiannisbot : great data - thanks for surfacing.

@lidel : great ideas - thanks for responding. I like your proposal with B.

@yiannisbot
Copy link
Member Author

Quick update, not directly related to the large number of "unreachable" peers, but rather related to the small number of "reachable" peers in the plot at the top.

We're currently monitoring most/all of PL's websites and reporting about some of them here: https://github.com/protocol/network-measurements/tree/master/reports/2023/calendar-week-20/ipfs#website-monitoring. All of these websites are supposed to have at least two stable providers, who pin those websites' CIDs: i) Fleek, ii) PL's pinning cluster.

Interestingly, digging a little deeper, it turns out that there are cases where either one, or both of these stable providers do not show as providers of the website CIDs. Pasting some screenshots below (no plots yet). I'll investigate further, but in the meantime:

  • I'd love input from @ns4plabs or @gmasgras from the bifrost point of view.
  • any other ideas of what might be happening from the group here.

drand.love providers
providers-drand
filecoin.io providers
providers-filecoin
protocol.ai providers
providers-protocolai

@yiannisbot
Copy link
Member Author

Quick update: it turns out that all kubo versions before v0.19.1 have issues with reproviding. The Bifrost and Fleek teams have been informed and asked to update to kubo-v0.19.1 or later.

We'll monitor the situation once their nodes have been updated and report back here before closing this issue. The number of reachable and unreachable providers is also being reported at probelab.io: https://probelab.io/websites/protocol.ai/#website-providers-protocolai (example for protocol.ai website).

@BigLep
Copy link

BigLep commented Jun 4, 2023

@yiannisbot:

it turns out that all kubo versions before v0.19.1 have issues with reproviding

Is this issue described somewhere (in this issue or in a Kubo issue)? I'd like to make it clear what the problem is that has now been fixed. (I'm not remembering myself...)

Thanks!

@yiannisbot
Copy link
Member Author

I don't have a pointer. @Jorropo mentioned this during one of our sync meetings. @Jorropo is there a description of the problem that we could use as a reference?

@Jorropo
Copy link
Contributor

Jorropo commented Jun 6, 2023

@yiannisbot I have not followed this thread closely here are the three things on my mind:

1. Kubo < v0.19.1

Our pinning nodes (Bifrost's infra, ...) were all running Kubo < v0.19.1 which has a 5 minute timeout on provides operation, sadly this included ProvideMany and meant if you were using the accelerated DHT client you had 5 minutes to provide your complete blockstore (which can easily take ~2 hours for a huge blockstore).

This could explain why we don't find OUR nodes even with ipfs dht findprovs -n 10000000 someUnreachablePopularCid, we need to update our pinning nodes before digging deeper (which I belive is happening / has been done when you reached out to them).

2. Bitswap provider records limits

We know that boxo/bitswap will only fetch 10 provider records at a time:
https://github.com/ipfs/boxo/blob/e2fc7f2fd0237afad200d7b0eec8b7a60bdc6644/bitswap/client/internal/providerquerymanager/providerquerymanager.go#L238

providers := pqm.network.FindProvidersAsync(findProviderCtx, k, maxProviders)

https://github.com/ipfs/boxo/blob/e2fc7f2fd0237afad200d7b0eec8b7a60bdc6644/bitswap/client/internal/providerquerymanager/providerquerymanager.go#L17

const maxProviders = 10

From mesurements we know there can be hundreds of providers records for theses CIDs and many of them are unreachable.

There are many different hardening we could do against this issue. Both in the Bitswap client, DHT client (downloader), DHT server and DHT client (provider).

2.1 track providers over time

An experiment I would like to see would be to track the dead providers for thoses CIDs and see if this is a constant set or if this churn fast. This would gives us hints to see if that is a temporary node or a stable but unreachable one.

That means, every X minutes (60 ? idk) do an ipfs dht findprovs -n 1000000 CID and log results to a file, for all CIDs we are tracking. Also try to connect all peerids in the set and log results.
Then after a few days of data (at least multiple DHT TTL ranges, 3d ~ 1w ? idk) graph peerids that are entering and leaving this set, both for reachable and unreachable peers.
If we see a high level of churn in the unreachable peers it could be likely that thoses are temporary peers who are downloading data advertising it and then shutting down (without cleaning up the advertisements since we don't do this).

2.2 what sampling FindProvidersAsync uses ?

A code review of the DHT's code would gives us hint, maybe the DHT is rigged and out of 200 providers with limit = 10 it always return the same 10, that would be bad because bitswap would then continuously try to connect to the same 10 unreachable nodes.

@yiannisbot
Copy link
Member Author

Excellent, thanks @Jorropo ! Basically, the issue that @BigLep was asking for is: ipfs/kubo#9722

@yiannisbot
Copy link
Member Author

Just confirming that both PL's pinning cluster and Fleek's fleet have upgraded to kubo-v0.20. @dennis-tra when you get a chance, can you produce tables similar to the ones further up (#49 (comment)) to make sure things are working as expected now? At this point, we'll be good to close this issue.

As a separate item: do we plan to include these results (on the number of [fleek providers, pinning cluster providers]) as part of the individual websites (e.g., https://probelab.io/websites/filecoin.io/)? We discussed this before, but can't remember what we agreed.

@dennis-tra
Copy link
Contributor

I think a graph like this would be great to have 👍 My concern was that we'll always be chasing the most recent PeerIDs to produce these graphs. E.g. now, fleek and PL's pinning cluster have upgraded to kubo-v0.20. Does that come with a new set of PeerIDs? I asked here to get info from Fleek.

I think if there's no reliable and automatic way to get these PeerIDs, we won't have develop trust in these graphs.

@gmasgras
Copy link

gmasgras commented Jun 7, 2023

Does that come with a new set of PeerIDs?

Our cluster peerIDs don't change between versions

@dennis-tra
Copy link
Contributor

Thanks, @gmasgras, that's good! Is there a location (perhaps somewhere in an IaC repo) where I could verify this every now and then? Let's say you're scaling up your nodes/peers in a few months; then, we would miss the new peers if we don't get a ping from you or verify this ourselves.

I think we can go ahead and produce such graphs that @yiannisbot asked for anyways. I just want to make preparations so that 1) we know if the set of peers has changed and 2) find out the updated list when we notice a change.

@dennis-tra
Copy link
Contributor

Our measurements are not super accurate before 2023-06-06, so the only number that would be concerning to me is filecoin.io on 2023-06-06. The other numbers (before 2023-06-06) should not be interpreted too much. I'll update next week again.


protocol.ai

Screenshot 2023-06-07 at 12 15 59

filecoin.io

image

drand.love

image

@yiannisbot
Copy link
Member Author

Thanks @dennis-tra ! Let's monitor for a bit longer before closing the issue, unless we continue seeing zero providers from any of the two.

@BigLep
Copy link

BigLep commented Jun 13, 2023

@yiannisbot and @dennis-tra : I think I am still seeing related issues here. For example, today @momack2 reported to me that she couldn't load https://probelab.io/ with Companion/Kubo. I was able to personally, but I believe that is because I had the data cached from viewing the site last night.

Digging a little deeper:

dig +short TXT _dnslink.probelab.io
_dnslink.probelabio.on.fleek.co.
"dnslink=/ipfs/QmW12bCzQnDWcM9gzEuv7saJVdypCdopHQQRTcEnS6pBXK"
ipfs dht findprovs QmW12bCzQnDWcM9gzEuv7saJVdypCdopHQQRTcEnS6pBXK
12D3KooWKzxBzxaNtu9LgVSSt99dVcDrjeNPgS6oT7NBeZ1Fw2UB
12D3KooWMb4fFN5m3Ks4ieYsJ37ZiFpNby6qet4rvMizhRhV3g26
12D3KooWDRYTV5zb7zVzx86MFUcZ2M5wbiMGtFRqsG9RHo1oJpmr
12D3KooWDnhm4MxUYRtPpLQC9yH5F98NY8t5dQAGybo6LpPi8aNH
12D3KooWDooSgX1VtrmdgztWTvARuQZnwAcmPLsF7eEhmqGv8qhK
12D3KooWAsPtiMfsCDocRKwfMXmNRAQEXawsXGzhi2o7oaY9Zgyy
12D3KooWBFHDqg91p4xjS7xTNHLpnmi5sspFGAajnoXCn6kSHRxD
12D3KooWBgBSKDSZ5qEVVzdTjR478GkfhJVJ7G8bL9CRu1ceZ1M4
12D3KooWBzbvHxkV3kBt8TPuMjGwknwsmCyMuy5uBAVPT8NarRXj
12D3KooWHkzMBNpUXG9M9mpPKxoxND2feqHxme6DnZQFbyMBK8Fd
12D3KooWHxzY8TXteMHC1hVBhLdKbqPgDdtcWBwYCVN1aaSzS1hn
12D3KooWEJDGiWXGGwocH192UjCYkyCYw1hE44ita6HcN6QBFk2G
12D3KooWGTCwqogyVRV652pFbfpAjNfSdGHxRzGWXEZEypG7zbAZ
12D3KooWGpDorRskFjbvRJHMGEYa9k8jKt576fqoBicfrP3bz5PY
12D3KooWGrgkcb8VR7ADTR7jaXUHKwFiJRBt73656eWcmU9fu4Ae
12D3KooWM8yq3b4SzkoS5Brb97qYF5HpdsCnYKRQV6xxhrPWj1qL
12D3KooWRK712g4G8nZ41TMndThiZGrqdwcGWhLY62QcmnMboLK8
12D3KooWS5PgiptUJ4CwG3bqCdyEiro2EwYYG5hAQu9GSuNPyQxR
ipfs dht findprovs QmW12bCzQnDWcM9gzEuv7saJVdypCdopHQQRTcEnS6pBXK | xargs ipfs dht findpeer
# EMPTY

https://ipfs.io/ipfs/QmW12bCzQnDWcM9gzEuv7saJVdypCdopHQQRTcEnS6pBXK/ was also hanging.

@yiannisbot
Copy link
Member Author

yiannisbot commented Jun 14, 2023

Thanks for the heads up! Depending on when exactly that was experienced, it was most likely due to some hiccups we had while making some changes to the website. During that period the site was not reachable more generally :)

It should be back up now and load without problems.

@markg85
Copy link

markg85 commented Jun 14, 2023

I found that an interesting test so i tried it too :)

❯ dig +short TXT _dnslink.probelab.io
_dnslink.probelabio.on.fleek.co.
"dnslink=/ipfs/QmNqMcQHZUtgDUXVBKKVdTYVC9CbxkyyR7Pcd5BWDY2Qia"

On 2 different nodes i tried: ipfs dht findprovs QmNqMcQHZUtgDUXVBKKVdTYVC9CbxkyyR7Pcd5BWDY2Qia | xargs ipfs dht findpeer
Both returned nothing.

The pl-diagnose tool (test 1. Is my content on the DHT?) is finding 6 peers:

{
  "error": null,
  "data": {
    "providers": [
      {
        "ID": "12D3KooWSng7jcocuKCmrS6JVvJBm3bsLYsGHM2tPiejtheWjdwx",
        "Addrs": [
          "/ip4/10.10.2.76/udp/4001/quic",
          "/ip4/127.0.0.1/udp/4001/quic",
          "/ip4/149.28.210.113/udp/4001/quic/p2p/12D3KooWSaYiuXw8gGJmBwK3VNGCTdzy3RfG9HkCJibGSbG2k5tN/p2p-circuit",
          "/ip4/10.10.2.76/tcp/4001",
          "/ip6/::1/udp/4001/quic",
          "/ip4/149.28.210.113/tcp/4001/p2p/12D3KooWSaYiuXw8gGJmBwK3VNGCTdzy3RfG9HkCJibGSbG2k5tN/p2p-circuit",
          "/ip6/::1/tcp/4001",
          "/ip4/45.63.49.57/tcp/4001/p2p/12D3KooWCyNM4LDMYjkNLEpgdkL29ysLQdhJaRJZjN63rxjzf5AB/p2p-circuit",
          "/ip4/127.0.0.1/tcp/4001",
          "/ip4/45.63.49.57/udp/4001/quic/p2p/12D3KooWCyNM4LDMYjkNLEpgdkL29ysLQdhJaRJZjN63rxjzf5AB/p2p-circuit"
        ]
      },
      {
        "ID": "12D3KooWJfES5Csgh9e2ZXMJnFUqSeT1L1kTkGtGXM2iBbpMhP28",
        "Addrs": [
          "/ip4/127.0.0.1/tcp/4001",
          "/ip6/::1/tcp/4001",
          "/ip6/::1/udp/4001/quic",
          "/ip4/127.0.0.1/udp/4001/quic",
          "/ip4/167.179.113.48/udp/4001/quic/p2p/12D3KooWKLpjh3pgRk3CkAd46q184NbYSG8DRAdKQngJRkrPQ1kH/p2p-circuit",
          "/ip4/173.212.242.188/tcp/4001/p2p/12D3KooWRQdtPVYvxufMh7xWU9NzyErScmBT3YyzQ5c7DLCEzDjR/p2p-circuit",
          "/ip4/10.11.2.69/tcp/4001",
          "/ip4/167.179.113.48/tcp/4001/p2p/12D3KooWKLpjh3pgRk3CkAd46q184NbYSG8DRAdKQngJRkrPQ1kH/p2p-circuit",
          "/ip4/173.212.242.188/udp/4001/quic/p2p/12D3KooWRQdtPVYvxufMh7xWU9NzyErScmBT3YyzQ5c7DLCEzDjR/p2p-circuit",
          "/ip4/10.11.2.69/udp/4001/quic"
        ]
      },
      {
        "ID": "12D3KooWLSMxgUkAvcnnHUsLJDU8Yo7Y2Nv4nL3qfHM8qCJGPccc",
        "Addrs": [
          "/ip4/10.15.2.91/udp/4001/quic",
          "/ip4/45.32.94.168/tcp/4001/p2p/12D3KooWQXvJPL31XqRcjQ9VyezzevcsmT52EMcqgzTnB4LkFjva/p2p-circuit",
          "/ip4/10.15.2.91/tcp/4001",
          "/ip6/::1/udp/4001/quic",
          "/ip4/127.0.0.1/tcp/4001",
          "/ip6/::1/tcp/4001",
          "/ip4/45.32.94.168/udp/4001/quic/p2p/12D3KooWQXvJPL31XqRcjQ9VyezzevcsmT52EMcqgzTnB4LkFjva/p2p-circuit",
          "/ip6/2002:2d51:2752::2d51:2752/tcp/4001/p2p/12D3KooWM4e964YGB4vkCqR1g1pRpaqiJSmFhrL7ctLw7bg5SAeE/p2p-circuit",
          "/ip4/45.81.39.82/tcp/4001/p2p/12D3KooWM4e964YGB4vkCqR1g1pRpaqiJSmFhrL7ctLw7bg5SAeE/p2p-circuit",
          "/ip4/127.0.0.1/udp/4001/quic"
        ]
      },
      {
        "ID": "12D3KooWGvfULN65snG8NUXQZoYBzqQWf4tszfgyAAA8FXcYkxnu",
        "Addrs": [
          "/ip4/127.0.0.1/tcp/4001",
          "/ip4/45.76.175.24/udp/4001/quic/p2p/12D3KooWSyoWZnbbg4yXjB63BhkdHzRr82DnD5GwbWM8agJoSTVp/p2p-circuit",
          "/ip6/::1/udp/4001/quic",
          "/ip4/10.11.1.124/tcp/4001",
          "/ip4/173.212.221.236/udp/4001/quic/p2p/12D3KooWMMzjNFmAQ2hGkxrsgddnuVb8EHsSPEv1Xjse6eBqWeu9/p2p-circuit",
          "/ip4/10.11.1.124/udp/4001/quic",
          "/ip4/127.0.0.1/udp/4001/quic",
          "/ip4/173.212.221.236/tcp/4001/p2p/12D3KooWMMzjNFmAQ2hGkxrsgddnuVb8EHsSPEv1Xjse6eBqWeu9/p2p-circuit",
          "/ip4/45.76.175.24/tcp/4001/p2p/12D3KooWSyoWZnbbg4yXjB63BhkdHzRr82DnD5GwbWM8agJoSTVp/p2p-circuit",
          "/ip6/::1/tcp/4001"
        ]
      },
      {
        "ID": "12D3KooWKzxBzxaNtu9LgVSSt99dVcDrjeNPgS6oT7NBeZ1Fw2UB",
        "Addrs": [
          "/ip4/127.0.0.1/tcp/4001",
          "/ip6/::1/tcp/4001",
          "/ip6/::1/udp/4001/quic",
          "/ip4/51.15.223.137/udp/4001/quic/p2p/12D3KooWAjDZ4ePoahYcLNwv56gojN6B1u3QFywZychWtqsAmZ3G/p2p-circuit",
          "/ip4/51.15.223.137/tcp/4001/p2p/12D3KooWAjDZ4ePoahYcLNwv56gojN6B1u3QFywZychWtqsAmZ3G/p2p-circuit",
          "/ip4/10.244.1.243/udp/4001/quic",
          "/ip4/127.0.0.1/udp/4001/quic",
          "/ip4/107.173.80.247/udp/4001/quic/p2p/12D3KooWLC9GiBKpW469Dg8RJWin6C2QM4f13Pz5VUKeckfb13L9/p2p-circuit",
          "/ip4/10.244.1.243/tcp/4001"
        ]
      },
      {
        "ID": "12D3KooWMb4fFN5m3Ks4ieYsJ37ZiFpNby6qet4rvMizhRhV3g26",
        "Addrs": [
          "/ip4/188.166.254.30/tcp/4001",
          "/ip4/10.244.1.49/udp/4001/quic",
          "/ip4/10.244.1.49/tcp/4001",
          "/ip6/::1/tcp/4001",
          "/ip4/188.166.254.30/udp/4641/quic",
          "/ip4/127.0.0.1/udp/4001/quic",
          "/ip6/::1/udp/4001/quic",
          "/ip4/127.0.0.1/tcp/4001",
          "/ip4/188.166.254.30/tcp/46939"
        ]
      }
    ]
  }
}

Just reporting it here. Not sure if it's of value or just noise.

@yiannisbot
Copy link
Member Author

Thanks Mark! I should note that the site was not configured to be served over IPFS at all until very recently (this week). Checking in more detail, it seems that it's still not pinned on PL's pinning cluster, as per: https://github.com/protocol/bifrost-infra/pull/2606 (private repo, sorry) and the request to pin it on Fleek's cluster has been submitted a couple of days ago (cc: @iand), so might still not have been acted upon.

All that said, it can be the case that currently the site does not have stable providers on the IPFS DHT (which would also explain the large latencies seen over kubo at: https://probelab.io/websites/#websites-web-vitals-heatmap-KUBO-ttfb-p90).

Bear with us :)

@iand
Copy link
Contributor

iand commented Jun 15, 2023

@yiannisbot and @dennis-tra : I think I am still seeing related issues here. For example, today @momack2 reported to me that she couldn't load https://probelab.io/ with Companion/Kubo. I was able to personally, but I believe that is because I had the data cached from viewing the site last night.

I believe this was at the exact time we switched from Github pages to Fleek which involved a couple of hours of unavailability due to a DNS provider error. Prior to that the site was not accessible via IPFS. There were some announcements in #probe-lab on filecoin slack at the time.

@markg85
Copy link

markg85 commented Jun 15, 2023

I'm not sure if that timing is coincidental or if something bigger is going on.
Just to verify, i tried it again.
2 nodes.

  1. on hetzner.
    ipfs dht findprovs QmNqMcQHZUtgDUXVBKKVdTYVC9CbxkyyR7Pcd5BWDY2Qia | xargs ipfs dht findpeer returns nothing.
    The node itself (/ip4/116.203.242.65/tcp/4001/p2p/12D3KooWLVFqkLQGa4rRFmqrVtTnm4CkKrySm8V1cCebxnDmx53N) is dialable which i verified with pl-diagnose again (awesome tool!)

  2. local node on fiber internet.
    I explicitly re-verified open ports as this is on ISP router crap. It's open and pl-diagnose is able to connect to my node just fine.
    But ipfs dht findprovs QmNqMcQHZUtgDUXVBKKVdTYVC9CbxkyyR7Pcd5BWDY2Qia | xargs ipfs dht findpeer still returns nothing.

Both nodes do findprovs, just no peer.

❯ ipfs dht findprovs QmNqMcQHZUtgDUXVBKKVdTYVC9CbxkyyR7Pcd5BWDY2Qia
12D3KooWAWs3DKyTXc4k9MXeqPW4w68jTGWUfAziSMKije3BiNQs
12D3KooWAYbmuayoXZJgqFNE6uHbD8cJFD53YiEbTf1rPt1AocFk
12D3KooWBYUS2fFj6qwoWwFvbf4vFaxXotY1GNCVCFs3mrdtEhVZ
12D3KooWBbFwQnbWyEsK6sVRATHQGTrBt6GGDnU8UcyCZhmmbJF1
12D3KooWBtE2HtXm1nkgaTKRqLJurQ5ZHhyYz2zGGJA7ZSy6gL1G
12D3KooWCQtWXgVqXoT2Xar94wr7TbMwzj9UNzRtt489zgsiefBC
12D3KooWCf4aBcT4qfvmfb1TuVEFyHCf7oSpJXpaKC8CYYWyM4Um
12D3KooWDJffbv8UTTyi3kpzZgiQzTmQUcNKg5FQvRcCasojNvGN
12D3KooWJ3jBrkTFHCrRLoYSLVqmx5rtTH2Y9ekfdKv2eTFNqVTM
12D3KooWECiG4BHxGHaBvZcDTaLpwYjrt7w82f887buE7ZyCXB8d
12D3KooWGm89QhvwpNgjPhvAvrs66D9ncim4AvNXTAxrCbBBYyDM
12D3KooWGmj7KuYuvyURYzgLJq1kXSB4dMeMQvQCKxrNv8YrcxEv
12D3KooWGvfULN65snG8NUXQZoYBzqQWf4tszfgyAAA8FXcYkxnu
12D3KooWHRS1h1e3uh14uKvZGvUcD1TSjbinzWxYgpHQhjS1FwCG
12D3KooWNGW6w3LKn8zu84C82vyLdfMNu3c9FqWQdSwzJAKcWWUV
12D3KooWJfES5Csgh9e2ZXMJnFUqSeT1L1kTkGtGXM2iBbpMhP28
12D3KooWLCDxcYvx5tUmyguhkzZ45XETzMPizveFJTc97mdGQU5G
12D3KooWLSMxgUkAvcnnHUsLJDU8Yo7Y2Nv4nL3qfHM8qCJGPccc
12D3KooWMb4fFN5m3Ks4ieYsJ37ZiFpNby6qet4rvMizhRhV3g26
12D3KooWSunHX34hxqhRJ382cxKQ8uXtwM445EoW8SyuAwfQyLfZ

and the other node:

❯ ipfs dht findprovs QmNqMcQHZUtgDUXVBKKVdTYVC9CbxkyyR7Pcd5BWDY2Qia
12D3KooWAWs3DKyTXc4k9MXeqPW4w68jTGWUfAziSMKije3BiNQs
12D3KooWBYUS2fFj6qwoWwFvbf4vFaxXotY1GNCVCFs3mrdtEhVZ
12D3KooWBbFwQnbWyEsK6sVRATHQGTrBt6GGDnU8UcyCZhmmbJF1
12D3KooWDJffbv8UTTyi3kpzZgiQzTmQUcNKg5FQvRcCasojNvGN
12D3KooWECiG4BHxGHaBvZcDTaLpwYjrt7w82f887buE7ZyCXB8d
12D3KooWGm89QhvwpNgjPhvAvrs66D9ncim4AvNXTAxrCbBBYyDM
12D3KooWHRS1h1e3uh14uKvZGvUcD1TSjbinzWxYgpHQhjS1FwCG
12D3KooWNGW6w3LKn8zu84C82vyLdfMNu3c9FqWQdSwzJAKcWWUV
12D3KooWLCDxcYvx5tUmyguhkzZ45XETzMPizveFJTc97mdGQU5G
12D3KooWSunHX34hxqhRJ382cxKQ8uXtwM445EoW8SyuAwfQyLfZ
12D3KooWQEJsbohVu1TNWxX3tkRMFhPA2HEozHMRELE8r6JbUpRE
12D3KooWAYbmuayoXZJgqFNE6uHbD8cJFD53YiEbTf1rPt1AocFk
12D3KooWBtE2HtXm1nkgaTKRqLJurQ5ZHhyYz2zGGJA7ZSy6gL1G
12D3KooWCQtWXgVqXoT2Xar94wr7TbMwzj9UNzRtt489zgsiefBC
12D3KooWCf4aBcT4qfvmfb1TuVEFyHCf7oSpJXpaKC8CYYWyM4Um
12D3KooWJ3jBrkTFHCrRLoYSLVqmx5rtTH2Y9ekfdKv2eTFNqVTM
12D3KooWGmj7KuYuvyURYzgLJq1kXSB4dMeMQvQCKxrNv8YrcxEv
12D3KooWGvfULN65snG8NUXQZoYBzqQWf4tszfgyAAA8FXcYkxnu
12D3KooWJfES5Csgh9e2ZXMJnFUqSeT1L1kTkGtGXM2iBbpMhP28
12D3KooWLSMxgUkAvcnnHUsLJDU8Yo7Y2Nv4nL3qfHM8qCJGPccc

It does look like the diagnose tool is having these same issue this time. Checking to gives me:

{
  "error": null,
  "data": {
    "providers": [
      {
        "ID": "12D3KooWAWs3DKyTXc4k9MXeqPW4w68jTGWUfAziSMKije3BiNQs",
        "Addrs": []
      },
      {
        "ID": "12D3KooWAYbmuayoXZJgqFNE6uHbD8cJFD53YiEbTf1rPt1AocFk",
        "Addrs": []
      },
      {
        "ID": "12D3KooWBYUS2fFj6qwoWwFvbf4vFaxXotY1GNCVCFs3mrdtEhVZ",
        "Addrs": []
      },
      {
        "ID": "12D3KooWBbFwQnbWyEsK6sVRATHQGTrBt6GGDnU8UcyCZhmmbJF1",
        "Addrs": []
      },
      {
        "ID": "12D3KooWBtE2HtXm1nkgaTKRqLJurQ5ZHhyYz2zGGJA7ZSy6gL1G",
        "Addrs": []
      },
      {
        "ID": "12D3KooWCQtWXgVqXoT2Xar94wr7TbMwzj9UNzRtt489zgsiefBC",
        "Addrs": []
      },
      {
        "ID": "12D3KooWCf4aBcT4qfvmfb1TuVEFyHCf7oSpJXpaKC8CYYWyM4Um",
        "Addrs": []
      },
      {
        "ID": "12D3KooWDJffbv8UTTyi3kpzZgiQzTmQUcNKg5FQvRcCasojNvGN",
        "Addrs": []
      },
      {
        "ID": "12D3KooWJ3jBrkTFHCrRLoYSLVqmx5rtTH2Y9ekfdKv2eTFNqVTM",
        "Addrs": []
      },
      {
        "ID": "12D3KooWECiG4BHxGHaBvZcDTaLpwYjrt7w82f887buE7ZyCXB8d",
        "Addrs": []
      },
      {
        "ID": "12D3KooWGm89QhvwpNgjPhvAvrs66D9ncim4AvNXTAxrCbBBYyDM",
        "Addrs": []
      },
      {
        "ID": "12D3KooWGmj7KuYuvyURYzgLJq1kXSB4dMeMQvQCKxrNv8YrcxEv",
        "Addrs": []
      },
      {
        "ID": "12D3KooWGvfULN65snG8NUXQZoYBzqQWf4tszfgyAAA8FXcYkxnu",
        "Addrs": []
      },
      {
        "ID": "12D3KooWHRS1h1e3uh14uKvZGvUcD1TSjbinzWxYgpHQhjS1FwCG",
        "Addrs": []
      },
      {
        "ID": "12D3KooWNGW6w3LKn8zu84C82vyLdfMNu3c9FqWQdSwzJAKcWWUV",
        "Addrs": []
      },
      {
        "ID": "12D3KooWJfES5Csgh9e2ZXMJnFUqSeT1L1kTkGtGXM2iBbpMhP28",
        "Addrs": []
      },
      {
        "ID": "12D3KooWLCDxcYvx5tUmyguhkzZ45XETzMPizveFJTc97mdGQU5G",
        "Addrs": []
      },
      {
        "ID": "12D3KooWLSMxgUkAvcnnHUsLJDU8Yo7Y2Nv4nL3qfHM8qCJGPccc",
        "Addrs": []
      },
      {
        "ID": "12D3KooWSunHX34hxqhRJ382cxKQ8uXtwM445EoW8SyuAwfQyLfZ",
        "Addrs": []
      },
      {
        "ID": "12D3KooWSng7jcocuKCmrS6JVvJBm3bsLYsGHM2tPiejtheWjdwx",
        "Addrs": []
      }
    ]
  }
}

By which i assume it found peers providing the data but wasn't able to connect to any of them (because of the "Addrs": [])

Again running:

❯ dig +short TXT _dnslink.probelab.io
_dnslink.probelabio.on.fleek.co.
"dnslink=/ipfs/QmdABJRBjLHpKeXKZaUSZTybiAo1zinWbZCTrxrseh7gL9"

Does show that we have a new CID now (QmdABJRBjLHpKeXKZaUSZTybiAo1zinWbZCTrxrseh7gL9 as opposed to QmNqMcQHZUtgDUXVBKKVdTYVC9CbxkyyR7Pcd5BWDY2Qia that i was testing). This new CID has the same result for me on my nodes as above. The diagnose tool has slightly better results:

{
  "error": null,
  "data": {
    "providers": [
      {
        "ID": "12D3KooWABVA7j4to3gd5iyXYEcsk75BhA2AZkjdBPXoC4ogodbe",
        "Addrs": []
      },
      {
        "ID": "12D3KooWCzy4VXQKPrTyDBhgNRKTfu2hFLD16UWfRWsfLSsLMb7z",
        "Addrs": []
      },
      {
        "ID": "12D3KooWETxjq5Zd2ykkRR64EocmcSXiGZaBFCfsLqHzAXsUdFBV",
        "Addrs": []
      },
      {
        "ID": "12D3KooWF5kx74Q3wWtA9duGH2TiiszFqujV5nkm6F3TCJUyQppc",
        "Addrs": []
      },
      {
        "ID": "12D3KooWHQFcTDQ9tAmry1Xe7Tr7UQoZwtY4SzHyfhFF3zHL3DHN",
        "Addrs": []
      },
      {
        "ID": "12D3KooWHavUihGWE3RSyqrL8PuMvGQTt2MT94i3cbEp2xPAhzL2",
        "Addrs": []
      },
      {
        "ID": "12D3KooWJUBaTkyoU9LUEE34yRa2xRgvoSEZjvYHppNTW38wv27v",
        "Addrs": []
      },
      {
        "ID": "12D3KooWJVYkh57EV7vyksxVksTheTdNrtAk7yEUgJpRj9XQXhaN",
        "Addrs": [
          "/ip4/157.90.132.176/tcp/4007",
          "/ip4/157.90.132.176/udp/12129/quic",
          "/ip4/157.90.132.176/udp/29824/quic",
          "/ip4/127.0.0.1/tcp/4007",
          "/ip4/127.0.0.1/udp/4007/quic"
        ]
      },
      {
        "ID": "12D3KooWKzxBzxaNtu9LgVSSt99dVcDrjeNPgS6oT7NBeZ1Fw2UB",
        "Addrs": [
          "/ip4/10.244.1.243/tcp/4001",
          "/ip4/51.15.223.137/tcp/4001/p2p/12D3KooWAjDZ4ePoahYcLNwv56gojN6B1u3QFywZychWtqsAmZ3G/p2p-circuit",
          "/ip4/193.8.130.182/tcp/4001/p2p/12D3KooWNmHPmCCDfcbkEE94YzcqWHPBmnCibVn8mfpohkAUKZYR/p2p-circuit",
          "/ip4/193.8.130.182/udp/4001/quic/p2p/12D3KooWNmHPmCCDfcbkEE94YzcqWHPBmnCibVn8mfpohkAUKZYR/p2p-circuit",
          "/ip6/::1/tcp/4001",
          "/ip4/10.244.1.243/udp/4001/quic",
          "/ip4/127.0.0.1/udp/4001/quic",
          "/ip6/::1/udp/4001/quic",
          "/ip4/51.15.223.137/udp/4001/quic/p2p/12D3KooWAjDZ4ePoahYcLNwv56gojN6B1u3QFywZychWtqsAmZ3G/p2p-circuit",
          "/ip4/127.0.0.1/tcp/4001"
        ]
      },
      {
        "ID": "12D3KooWLB6zRFsqqgv5rc37A6ohVEPWsZNQGLMBj4Wp7c9umYJ9",
        "Addrs": []
      },
      {
        "ID": "12D3KooWLj94sjxyu3aurTzyWZjkXZ8fK4R5RR15sZcjEqHBrCSw",
        "Addrs": []
      },
      {
        "ID": "12D3KooWLmDRDBXdvha8iUtMvkLvnKyXu3kvX6Wvnf1p9jWpjsW5",
        "Addrs": []
      },
      {
        "ID": "12D3KooWMNyNxS9ZUVg4XGvQgQPCrPpMdrPfhNJTKg48cFyUjTMR",
        "Addrs": []
      },
      {
        "ID": "12D3KooWMGzqCJWpYjwjDqwB5xRbXLfHs6raQy2aFhfuVBmCy1dJ",
        "Addrs": []
      },
      {
        "ID": "12D3KooWMXbZgGVdZNMwfy9o3G3YDYdEgRyRFvUxbVYpXVeQoaqW",
        "Addrs": []
      },
      {
        "ID": "12D3KooWMb4fFN5m3Ks4ieYsJ37ZiFpNby6qet4rvMizhRhV3g26",
        "Addrs": []
      },
      {
        "ID": "12D3KooWSRuwsRy9gSarV9SAMwMr1zdQqSuxUmPcYD1eak73T6qS",
        "Addrs": []
      },
      {
        "ID": "12D3KooWSqboAoCwcprGSepvTJTqLGMiWhoE6jkGjZKQte8zFdY3",
        "Addrs": []
      },
      {
        "ID": "12D3KooWR98E6JwwpjKh6bv2Bs5iSbg7D6tWG1bVnMipW7CgrA5X",
        "Addrs": []
      },
      {
        "ID": "12D3KooWS6RknQm8Ku36ATdbsRfcNBmzNh5JS2zGD3m5zkhJEXTU",
        "Addrs": [
          "/ip4/157.90.132.176/tcp/4009",
          "/ip4/157.90.132.176/udp/38040/quic",
          "/ip4/127.0.0.1/tcp/4009",
          "/ip4/127.0.0.1/udp/4009/quic",
          "/ip4/157.90.132.176/udp/52829/quic"
        ]
      }
    ]
  }
}

This does seem to hint at something bigger going on and the timing being just a coincidental thing.

@Jorropo
Copy link
Contributor

Jorropo commented Jun 15, 2023

To be clear to anyone trying to help (thx btw), ipfs dht findprovs by itself is not really usefull by itself as you are only sampling a really small part of the list. Try ipfs dht findprovs -n 100000 Qmfoo too to query all providers.

@markg85
Copy link

markg85 commented Jun 15, 2023

Tried that too, still empty returns.
ipfs dht findprovs QmdABJRBjLHpKeXKZaUSZTybiAo1zinWbZCTrxrseh7gL9 -n 100000 | xargs ipfs dht findpeer

Out of curiosity a count of findprovs:

❯ ipfs dht findprovs QmdABJRBjLHpKeXKZaUSZTybiAo1zinWbZCTrxrseh7gL9 -n 100000 | wc -l
21

@dennis-tra
Copy link
Contributor

Quick update with the most recent data:

protocol.ai

image

filecoin.io

image

drand.love

image

@yiannisbot
Copy link
Member Author

Thanks for the input everyone! The fact that there are zero Fleek providers in many cases is not good. We need to investigate further with them. A couple of questions:

  • @dennis-tra can you remind me if the count is only reachable providers, or both reachable and unrechable?
  • in @markg85 's samples of provider results, why do we have PeerIDs without the corresponding multi-addresses? These are supposed to be returned together after our latest fixes, no? Or is this a different case and I'm missing something?

@markg85
Copy link

markg85 commented Jun 19, 2023

@yiannisbot anything i can do to clear things up?
I used the pl-diagnose and was using ipfs 0.20 for the command outputs.

@BigLep
Copy link

BigLep commented Jun 19, 2023

There was conversation on 2023-06-19 between probelab and some of the Kubo maintainers. We ultimately need to create some issues in Kubo related to code changes here. @dennis-tra is going to create these.

@lidel : one thing I would like to make sure I have my handle on is how much does someone need to change the Kubo defaults to get into this position of providing all blocks. Let me know if my understanding below is incorrect...

By default a fresh Kubo installation will use Routing.Type = "auto". Assuming the node is directly dialable, it will function as a DHT server. Given the default Reprovider.Strategy = "all", it means that any blocks that these nodes have in their blockstore will have a provider record published to the DHT pointing to the node.

I basically want to make sure I understand how content like blog.libp2p.io has so many additional providers per https://probelab.io/websites/blog.libp2p.io/#website-trend-providers-bloglibp2pio

Also, how are nodes getting into the "Reachable Relayed" bucket? Are these nodes that are explicitly turning on server mode even though they aren't directly dialable?


For anyone watching this issue, I expect Kubo to address any resulting issues here in early Q3 as otherwise we're failing to meet a baseline practical usecase of "IPFS can be used to reliably host/serve static websites".

@yiannisbot
Copy link
Member Author

Thanks for the update @BigLep. I couldn't make it to the meeting, but wanted to follow up with this. Here's a few questions on @lidel 's previous suggestion, which AFAIU is what we're going with.

A different, a bit simpler approach would be to keep a single datastore, but instead introduce a new default "auto" Reprovider.Strategy that:

  • always announces pinned content (+implicitly pinned MFS) → ensures user content is always reachable asap

👍

  • announces the remaining blocks in cache (incl. ones that come from browsed websites) ONLY if a node was online for some time (we would add optionalDuration Reprovider.UnpinnedDelay to allow users to adjust the implicit default)

This requires identifying what we want "online for some time" to mean. We can take several directions based on the node's history of staying online. Was this discussed during the 2023-06-19 meeting? I'm not entirely sure what the item in the parenthesis implies exactly in terms of protocol changes.

  • TBD how we solve ipfs dag put and ipfs block put or other user content that is not pinned, but expected to "work instantly"
    • (A) we could flip --pin in them to true → breaking change (may surprise users who expect these to not keep garbage around, may lead to services running out of disk space)
    • (B) we could say that the ability for users to set Reprovider.Strategy to all and/or adjust Reprovider.UnpinnedDelay are enough here, ipfs routing provide exists, we could add --all to allow apps/users to manually trigger provide before Reprovider.UnpinnedDelay hits. (feels safer than A, no DoS, worst case a delay in announce on a cold boot)

I agree (B) is a better direction here. Going with (B) does it mean that we want to give the servers the option of overriding the default setting manually if they way to provide content?

@iand
Copy link
Contributor

iand commented Jun 20, 2023

We should analyse the unreachable providers some more.

  1. What are their average uptimes? How many times have they been seen in the network?
  2. Are they distinct nodes in the network or are they rotating peer ids on the same host? (can we relate ip address to peer id)
  3. Can we tell what type of node are they? We presume they are ipfs companion, but are they?
  4. Do we see peer ids in the "reachable" category one day an the "unreachable" category the next (or vice versa). Do we ever see the unreachable peer ids become reachable?
  5. How would we distinguish this from an attack on PL websites? Could we also monitor some non-PL websites to compare?

@guillaumemichel
Copy link
Contributor

IMO a cleaner and more efficient solution would be to change the protocol and spec (non-breaking change), to include a TTL field in the Kademlia Provide request. The new TTL field in the protobuf will be ignore by older nodes, and read by the new nodes.

The up-to-date DHT Servers would discard the provider records after the associated ttl if any, or after the default value if no ttl is provided.

The client can be aware of its average uptime. A client that has a low average uptime, is likely to provide a small number of CIDs, and can thus have a low reprovide interval (ttl=10 min). Large content providers usually have very long average uptime, and thus wouldn't change their reproviding interval.

Note that reproviding content more often for some set of node, will increase the DHT Servers load.

@BigLep
Copy link

BigLep commented Jun 20, 2023

A few things I think we need here to help with prioritization:

Impact

How much impact is this having (as in how much are users impacted by the fact that we have very few reachable providers for a website CID)? What metrics can we look at to see the impact? I assume we'd expect to see:

  1. more "IPFS Retrieval Errors"
  2. increased TTFB - (I was thinking comparing against HTTP is one proxy, but at least using ipld.io, it seems Kubo is always faster than HTTP.

Is there some measuring within kad-dht or the Boxo/Kubo code we should do to see how many provider records we had to go through before got to a peer we could connect with (and measure how long that took)? I want to make sure that as we make improvements here, we have a metric that that improves that translates to direct user experience.

Basically I want to substantiate "This significantly lengthens the resolution process to the extent that the whole operation potentially times out" in ipfs/kubo#9982

The main impact I'm aware of are the anecdotes of PL websites not loading for people who use Brave/Kubo or Companion/Kubo. This obviously isn't great, and I want to see it fixed. Knowing how widespread and serious the problem is helps determine how earnest our response is.

How this ends up happening in practice

I would like to make sure we have a shared hypothesis of how we get into this state of many unreachable providers and providers behind relays. Is it what I wrote in #49 (comment) ?

@dennis-tra
Copy link
Contributor

How much impact is this having?

Quick clarification: do you mean with "this" the current situation or the proposed mitigations? In the following, I assume it's the current situation (which, given the rest of your response, seems to make more sense).

  1. more "IPFS Retrieval Errors"
  2. increased TTFB

I think both are correct. However, one thing to consider is that probably many of the website lookups are resolved via Bitswap. AFAIK there's no way we could know which routing subsystem resolved the content when we query Kubo's gateway. So, I doubt that the retrieval errors would go down significantly (in our measurement). Similarly, the TTFB might also not be as affected by this for the same reason.

I can't find the related issue, but I believe I've read somewhere that we want to decrease the number of connected peers for the brave-built-in Kubo node. This change will likely put a spotlight on the issue we're discussing here because we will decrease the chances of Bitswap probabilistically resolving the content.

Basically I want to substantiate "This significantly lengthens the resolution process to the extent that the whole operation potentially times out"

True, this is just a hypothesis (I added this remark).

The main impact I'm aware of are the anecdotes of PL websites not loading for people who use Brave/Kubo or Companion/Kubo. This obviously isn't great, and I want to see it fixed. Knowing how widespread and serious the problem is helps determine how earnest our response is.

Same here.

I believe (again, an assumption) that in most of these anecdotal cases, the local Kubo node was not directly connected to the website's provider. This means Bitswap couldn't directly resolve the content. If we share this hypothesis, we'd need to make sure this is also the case in our measurements. Then we might be able to measure if any mitigation has an impact on the retrieval times.

However, to truly measure the impact we'd also need to know how often we are, by chance already connected to a website's provider and how often we are not. No idea how we could find this out.

How this ends up happening in practice

This is probably something @lidel could best comment on?

@yiannisbot
Copy link
Member Author

Replying with a few thoughts in separate comments, starting by @BigLep 's comments:

@BigLep one important thing to distinguish here is that:

  • in this issue we're focusing on DHT performance.
  • in https://probelab.io/websites/ipld.io (and for all other websites we monitor), we're getting kubo's performance - which means that there are very good chances (although we don't know/haven't measured this) the websites (since they're popular, presumably) are served over Bitswap and the negative impact of what we're discussing here is not really felt. (I don't think websites are fed into cid.contact, so I'm not considering that path as a potential solution).

Some experiments we have discussed with the ProbeLab team before, but didn't do, as we deprioritised them are:

  • build an experiment where we isolate the DHT from Bitswap and see where the websites are served from. Right now, we don't know. @dennis-tra mentioned that this requires some work, but see my next point on whether we can bypass the extra development and get the same results.
  • we have a fleet of nodes that use the DHT only for the DHT Lookup latency experiments. Instead of feeding the fleet with random CIDs (as we do), we feed them the website CIDs. We can then compare the lookup performance we see with: i) what we see at https://probelab.io/websites and (hopefully, approximately) understand if websites are served over Bitswap or not, ii) what impact the huge number of unreachable providers has on our websites if searched explicitly through the DHT.
  • we sniff a random CID from the network (but not a custom/gibberish one, as we currently use for the DHT Lookup) and feed that into the fleet for the DHT Lookup that we have. We can then compare the lookup time for popular CIDs (our websites) and the random CID (which we assume won't be very popular).

My personal opinion is that these results would be great to have. We deprioritised them earlier, because we didn't know if results would add value, but as it seems, they'll now be very valuable.

@yiannisbot
Copy link
Member Author

I like the TTL approach that @guillaumemichel is proposing (#49 (comment)) and I think it's something worth doing, but with this alone, we'll still be in the dark as to what impact is this having. Figuring out some of the details that @iand brings up (#49 (comment)) would definitely be valuable both for now and for the future (when we have the TTL solution implemented), I believe.

@markg85
Copy link

markg85 commented Jun 21, 2023

@dennis-tra

I can't find the related issue, but I believe I've read somewhere that we want to decrease the number of connected peers for the brave-built-in Kubo node.

brave/brave-browser#22068

I think that's already in the brave ipfs node (lowWater/highWater defaults to 20/40 with a grace of 20s).

@dennis-tra
Copy link
Contributor

Oh wow, thanks @markg85! That's even lower than in the issue that @yiannisbot has found: ipfs/kubo#9420

@BigLep
Copy link

BigLep commented Jun 21, 2023

Hi guys. I appreciate the remarks, especially the comment about how bitswap discovery is likely smoothing thing over. I think you all have some good ideas. I personally won't be able to give more critical thought this week. It's not clear to me yet that we should drop other things that we're doing to prioritize this measurement exercise, but I leave it up to you all to determine if/how to apply resources here during this end-of-quarter and perf crunch time.

Another thought is that given the above reasoning, once our measurement tooling peers to one of the providers with the content (Fleek or Collab cluster), then the DHT discovery is irrelevant. Given we do multiple runs per website, do all the websites as one task, and let things warmup per https://probelab.io/tools/tiros/ , I assume we quickly start getting into "bitswap discovery" vs. "DHT discovery" mode and thus aren't going to see the impact in our current measurements.

@yiannisbot
Copy link
Member Author

@BigLep as mentioned earlier, the number 1 item that will allow us to give the right priority to this is what is the impact of the huge number of unreachable providers. This boils down to:

  • the website performance that we see right now is acceptable/not alarming for all websites we monitor.
  • if all responses come from Bitswap, then the picture we're looking at is blurry.
  • if not all responses come from Bitswap, but the DHT is involved in the discovery, then the big number of unreachable providers is somehow magically dealt with by kubo :-D In this case, we're good to park this issue for later investigation.

During the ProbeLab CoLo today, we've identified a few ways that would shed some light here. We'll do an effort estimate tomorrow and report back.


Another thought is that given the above reasoning, once our measurement tooling peers to one of the providers with the content (Fleek or Collab cluster), then the DHT discovery is irrelevant. Given we do multiple runs per website, do all the websites as one task, and let things warmup per https://probelab.io/tools/tiros/ , I assume we quickly start getting into "bitswap discovery" vs. "DHT discovery" mode and thus aren't going to see the impact in our current measurements.

This is true, but for each run the node is restarted. So, it basically comes down to the question of how quickly after being spun up and making the first requests does the node connect to one of the stable providers. After that, indeed, it might get everything through Bitswap from that node.

@dennis-tra
Copy link
Contributor

Copying over some comments from our Co-Lo:

We talked about the following measurement methodology:

We would have a set of lightweight nodes that support the DHT and Bitswap protocol. We would feed these nodes a static list of CIDs (or IPNS keys) and instruct them to look up provider records in the DHT. We do so by using the API FindProvidersAsync, which Bitswap also uses internally when the user requests content in Kubo. This method returns a channel that we then consume. On that channel, we receive provider records in the order as they are discovered in the DHT. For each discovered provider record, we spawn a new go-routine (with a maximum concurrency of 10, which is the same as in Kubo/go-bitswap, I believe). In each go-routine, we pass the provider information and CID to Bitswap to connect to the remote peer and exchange the blocks for us. We’ll measure along the way:

  • Total providers records found
  • Time to look-up the first provider record
  • Time to look-up the first reachable provider
  • Time to first byte from first reachable provider
  • more?

After we “probed” a CID, we will disconnect from all peers we had found in the provider records. You could argue that disconnecting is not really important because we’re explicitly using the DHT to resolve the CID so that Bitswap cannot interfere. However, when we probe the next CID and find a peer in the provider records that was also a peer from a previous CID, we would likely already be connected to that peer. This means the TTFB will be much shorter because we wouldn’t need to establish a connection first.


The above is significantly different from our current measurement tools that it wouldn't make sense to squeeze in somehow. Therefore, we're proposing a new tool. However, we can reuse a lot of the code from either our existing infrastructure and these links that @guillaumemichel provided:

I expect this to take four days of uninterrupted work, which realistically means it's double that. This includes

  • the raw development (a lot of code reuse, but also new bits compared to the existing tools like the Bitswap integration)
  • deploying it to our infra (a lot of code-reuse)
  • setting up monitoring (e.g., Grafana)
  • finding out that something doesn't work as expected (measured data doesn't make sense, there's a memory leak, whatever)
  • documenting the methodology (e.g., on our website)
  • making sense of the results (e.g., producing plots for our website)
  • comms around the results and methodology

@BigLep
Copy link

BigLep commented Jul 6, 2023

@yiannisbot @dennis-tra : understood on the need for new measurement to determine the impact. Do we have an estimate on when this will be completed?

@yiannisbot
Copy link
Member Author

Since the consequences of this don't seem to be catastrophic, this has been deprioritised for now in favour of the DHT work. We haven't put it on the plan, but I expect that it will be taken up and completed within Q3, or early Q4. Do you agree @dennis-tra ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants