-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strip or anonymize non-critical telemetry from requests #152
Comments
Automattic does take credit for this (not WordPress.org mind you). Some design considerations
|
Also do we pass telemtetry if our plugin is not rewriting? Another consideration. We don't pass a UA field at all right now. |
Move to Discussions? |
@namithj nah -- we can converse here... Then we can propose to discussions -- nobody other than core team uses GitHub anyways. Most stick to slack due to laziness and lack of GitHub knowledge. |
I'm not sure that a single constant is the right call for this one. I think this is very similar to the debugging setup.
This gives the user full control while still allowing us to collect telemetry, but only with consent. It also means a shared user and developer experience for the debugging and telemetry features. For the code that strips each piece of data, it would run by default (i.e. telemetry disabled). When the user/site owner has consented for that piece of data to be sent, the stripping is skipped, but the code would still anonymize whatever isn't anonymized already (such as the URL in the user-agent). I'm don't think we need to have this behind |
Let's put together a list of every telemetry item that gets sent to each endpoint, along with an indicator of whether it's necessary at all, whether it's necessary in cleartext (i.e. can be anonymized), and what part of the request it is (GET/POST/Header along with key name). A table would do nicely. Maybe a wiki page so that others can edit it? I'd start the page, but I don't have write access :) |
This quickly generated list might be of use, but we should absolutely verify all the finer details of it before we consider it accurate or complete. If nothingnelse, it might hint at some requests we didn't think of, or it might trigger an idea of other possible requests that it might have missed. Do not consider the sample values to be accurate representations of what is actually generated. For example, the User Agent sample doesn't include the URL. |
You can't realistically use the API without telemetry - how are you going to check if there's an update to version. 4.2 of XYZ plugin without revealing the obvious? You'd have to completely rewrite the whole API in a way that makes it very inefficient. My take on it is to simplify the whole question by anonymizing the domain before sending it - hashing it will provide a unique (enough) string that we can still get a count of websites using the updater without actually knowing who they are. At that point, nothing we have is traceable to a particular site, and we only need to store it as aggregate per node in order to get the stats we need. Simple approach, privacy question fully resolved, and no possibility of bad actors using site-specific data. Is there anything in there that's uniquely identifiable besides the domain? I don't think so. IP address doesn't meet that criteria and is important data as-is for knowing network and geography to suggest things like needing another node Europe. We would need to process it for specific purposes before aggregating it, but that could be done before storing it. I hadn't thought we would need to store data on a per-site basis at all - aggregating everything would be preferred, but to do that we need to determine what we might need to know in the future so we can record it now. For that, we need to think of (a) meaningful stats and (b) managing a distributed network. The first use case I'd throw in there is knowing how many unique sites in a given region are updating from each node. If more than some threshold of sites from idk, Nepal start updating from servers in California and Germany, we want to work out why there isn't a node they can use that has greater network efficiency for them. This all leads me to 3 observations:
I think we can resolve every use case we could come up with for having user-identifiable data, but not certain about some individualized data. If we can get what we need with aggregate data only, we should do so. In my example here, if we don't need a count of how many sites are connecting from Nepal, maybe analytics at each node is enough to know how much traffic is being served to certain regions. But what other use cases might there be? We should sort that out before we start turning everything off. |
It's not so much about removing all telemetry, but there are absolutely things that aren't needed to determine an appropriate update package, or to check PHP/WP/DB compatibility or plugin dependency fulfilment. Here's just a few examples of things we can likely strip out:
There will be more, but that might demonstrate where, like others have been doing recently, we can afford to offer users a choice that they haven't had in the past. We should evaluate:
|
I think the approach stands that if you disconnect the data from being identified with a specific site, the issue is resolved without even dropping anything. Not that there aren't things we can drop, just that this is the closest thing to a silver bullet solution that doesn't give away the farm for the value of the aggregate data. It's also way easier to implement than dealing with individual telemetry settings. |
No real effect on Core dev. There's a separate "multisite_enabled" for tallying the number of single sites/multisite networks. While in theory it could be used to determine whether to support multisite in the future, that's not going to happen. More for graphs than anything. I agreed that there's no privacy concern there. However, it may still be something we shouldn't slurp just because we can.
That doesn't give enough information to isolate IIS, so I wouldn't use the platform flag alone to determine that. I'd collect the server information (visible in Site Health) instead if I wanted to evaluate my Filesystem API maintenance time allocation in the future. I would put this in the category of evaluating the impact of a bug or lack of existing support rather than whether to continue existing support. Of course, AspirePress isn't WordPress Core, but I do think the original reason for it being collected is relevant to whether we value in continuing to collect it.
While this can still be behind a flag of sending anonymized data, I agree with what you say about its relevance and usage. We agree that there's data that's useful for making decisions. However, I don't agree that the issue is necessarily resolved by anonymizing the data. There's a ton of software that has a choice on whether to share anonymous usage data. As you say, it's not that there aren't things we can drop, but I do think we should be clear in how we classify information. That way, we can have the trust of users while also collecting whatever data we feel is necessary or useful. |
Great convo you know I tend to agree with @toderash -- less is more here. One way Hashing the website even rather than providing the actual website would be better. It's still somewhat backward compatible. Could be used for analytics but not identifiable. Having a huge telemetry project seems a bit out of scope unless there is some other PII data in there. |
I retitled the issue to reflect a more realistic take on the matter: no one was proposing stripping all telemetry, that would make things useless. In fact an internal managed corporate install ought to get even more telemetry. But out in the big bad world, especially a decentralized ecosystem, this info should be on a need-to-know basis, and enumerating what gets sent where when is a good start. |
wrt multisite/user counts, I was thinking just of scale rather than binary multisite yes/no. 12-18 years ago, there was greater reason to know that, but with WP being a mature platform, we already know it can scale. Would be good to know what statistical use all of the data is for core teams right now. Key takeaway is that if we have a simple enough approach, it could potentially ship with our first release. From there, we could take more time to work out the necessity of the individual bits of info. It's possible that the simple approach could sanitize the data source enough that it could all be published responsibly, which would be a big win for transparency and for the dev community to benefit from sharing telemetry. |
My opinion is this is something that we don't have to tackle right now. Need to do our research, publish all potential overreaches in collecting data. Educate everyone regarding the pitfalls and then launch a solution. |
Agreed, we have a minimal-effort approach to allow us to defer discussion around major changes. |
The major issue with this is the discovered issue is mostly just bad programming and it's what they did with the data that's problematic (something they would anyway have, even without the UA manipulation). So just fixing that hole doesn't accomplish anything. What we should have is to create a document which lists all end points with the Request data and figure out what all data in it is actually required to provide the required response and what all is data overreach. Only this will help us resolve the issue, the first step should be compiling this data. We can publish this info for everyone to see and there will be a discussion around it. We can provide a solution only once this data is compiled. Anything before that would be simply plugging what's discovered which will provide a false sense of security. |
Need to build upon the Gist provided by @costdev and create a blog post or publicly document (blog post?) it with explanations about the data and its requirement. |
I tend to agree on an initial step to document where and what telemetry WordPress is collecting for the purposes of discussion. I'm going to make a documentation issue on this using @costdev gist as a starting point. |
Remember that we need to verify the Gist first. AI is great for quick results/indications, but it can be 50/50 accuracy on some of these things. |
Absolutely I'll mentioned that in the ticket |
Given that upstream has shown it's willing to abuse telemetry data for its ongoing vendetta, we need to strip out any data from outgoing requests that isn't essential to the function of the endpoint. We won't use it ourselves, and upstream doesn't deserve to have it. If some of the telemetry items turn out to be required, we should anonymize it as much as possible.
The text was updated successfully, but these errors were encountered: