-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example: Measure subnet diversity #42
base: main
Are you sure you want to change the base?
Conversation
This is nice, but it is not as useful as an example, I had some code to calculate this on the fly for each query, and can be averaged and stored to help decide where to store the data... we should start call it I didn't open a PR with that code because it is not a priority. But these numbers matches what I observed as well. |
Here is the commit 73efb7b |
Not saying we need to merge this but calling it useless is quite a stretch? How would build counter measures if you don't know the base subnet distribution? |
"not as useful" not useless, it is useful to compare the results from thorough analysis to a quick and iterative one. Just like the size estimation, the base subnet distribution is going to be calculated from the average of previous queries. Remember, the "base subnet distribution" is a dynamic thing like the size estimation, so we shouldn't just hardcode it after an expensive offline check, instead it should be something the client keeps track of from previous queries. If it doesn't have any previous queries, then it just defaults to storing to all responding nodes since that is the conservative choice. |
Given that the BEP_0042 hash function provides a uniform distribution of subnets, the problem of detecting outliers can be determined without prior knowledge. It is very similar to the birthday problem. At the same time, you can statistically measure how close the given subnets are to a uniform distribution. Did I have all this math in uni 10y ago? Yes. Can I still do it? Need to figure this out. Priori assuming that you can only do this with previous data is wrong. |
I don't think you can predict this, you can only observe it. If all DHT nodes started to churn and only one subnet is still standing, are you just going to say no this is not what I expect so I won't use these nodes at all? I know this is an extreme example, but my point is that the only assumption you can make in realtime is that most of previous queries saw honest nodes so you are comparing the current query to previous distribution of nodes over subnets. Same as comparing the current distances distribution to the average of previous distributions (summarized by the dht size estimate). |
If you want to have it work for testnet for example too, then you are correct. |
Not a bad distribution except for some unused blocks and some hot spots. 13'400 sample nodes. If you want an easy mainnet only dirty rule: If any subnet is more than 20% of all nodes in a bucket (4 in case of k=20), kick some of them. 20% is very generous. You could easily do 10% (2 in case k=20). If you wanna do your full flexible approach, feel free.
|
Kicking nodes is never advisable, the measurement should always be "how many nodes of the responders should we store data at, so that they have similar distribution to the average seen so far"... that's what we already do with distances, and the worst case is that you store at all nodes, which is not bad at all. |
This script looks up random IDs and counts the number of times, the found nodes share the same IP /8 subnet (the first byte of the IP).
Here is an example after 160 lookups:
On average looking up k=20 nodes (November 2024), you can expect
Related to #41