Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strip or anonymize non-critical telemetry from requests #152

Open
chuckadams opened this issue Nov 7, 2024 · 21 comments
Open

strip or anonymize non-critical telemetry from requests #152

chuckadams opened this issue Nov 7, 2024 · 21 comments
Labels
enhancement New feature or request needs triage needs to be discussed and worked on question Further information is requested

Comments

@chuckadams
Copy link

Given that upstream has shown it's willing to abuse telemetry data for its ongoing vendetta, we need to strip out any data from outgoing requests that isn't essential to the function of the endpoint. We won't use it ourselves, and upstream doesn't deserve to have it. If some of the telemetry items turn out to be required, we should anonymize it as much as possible.

@asirota
Copy link
Member

asirota commented Nov 7, 2024

See here.

image

The repo

Automattic does take credit for this (not WordPress.org mind you).

Some design considerations

  • a global AP_KILL_ALL_TELEMETRY flag that covers all telemetry data from the /check endpoints - need to articulate them and monitor which they are and change manage properly (a project all by itself)

  • a selective Telemetry picker in the AU configuration in an advanced features tab

  • what telemetry would AP never wish to suppress at least in our implementation?

@asirota asirota added enhancement New feature or request question Further information is requested needs triage needs to be discussed and worked on labels Nov 7, 2024
@asirota asirota changed the title Strip telemetry from API requests Strip telemetry from API requests and User Agent Nov 7, 2024
@asirota
Copy link
Member

asirota commented Nov 7, 2024

Also do we pass telemtetry if our plugin is not rewriting? Another consideration. We don't pass a UA field at all right now.

@namithj
Copy link
Contributor

namithj commented Nov 7, 2024

Move to Discussions?

@asirota
Copy link
Member

asirota commented Nov 7, 2024

@namithj nah -- we can converse here... Then we can propose to discussions -- nobody other than core team uses GitHub anyways. Most stick to slack due to laziness and lack of GitHub knowledge.

@costdev
Copy link
Contributor

costdev commented Nov 8, 2024

I'm not sure that a single constant is the right call for this one.

I think this is very similar to the debugging setup.

  • Two constants:
    • (bool) AP_ENABLE_TELEMETRY - Default: false
    • (array) AP_ENABLE_TELEMETRY_TYPES - Default: []
  • UI:
    • Enable telemetry - All default to unchecked. Checking this one reveals the others.
      • Send X
      • Send Y
      • Send Z

This gives the user full control while still allowing us to collect telemetry, but only with consent. It also means a shared user and developer experience for the debugging and telemetry features.

For the code that strips each piece of data, it would run by default (i.e. telemetry disabled). When the user/site owner has consented for that piece of data to be sent, the stripping is skipped, but the code would still anonymize whatever isn't anonymized already (such as the URL in the user-agent).

I'm don't think we need to have this behind AP_ENABLE though. Much like the debug types, users should be able to avail of the feature even if they're just using the dotorg API.

@chuckadams
Copy link
Author

chuckadams commented Nov 8, 2024

Let's put together a list of every telemetry item that gets sent to each endpoint, along with an indicator of whether it's necessary at all, whether it's necessary in cleartext (i.e. can be anonymized), and what part of the request it is (GET/POST/Header along with key name). A table would do nicely. Maybe a wiki page so that others can edit it?

I'd start the page, but I don't have write access :)

@costdev
Copy link
Contributor

costdev commented Nov 8, 2024

This quickly generated list might be of use, but we should absolutely verify all the finer details of it before we consider it accurate or complete. If nothingnelse, it might hint at some requests we didn't think of, or it might trigger an idea of other possible requests that it might have missed.

Do not consider the sample values to be accurate representations of what is actually generated. For example, the User Agent sample doesn't include the URL.

@toderash
Copy link

toderash commented Nov 8, 2024

You can't realistically use the API without telemetry - how are you going to check if there's an update to version. 4.2 of XYZ plugin without revealing the obvious? You'd have to completely rewrite the whole API in a way that makes it very inefficient.

My take on it is to simplify the whole question by anonymizing the domain before sending it - hashing it will provide a unique (enough) string that we can still get a count of websites using the updater without actually knowing who they are. At that point, nothing we have is traceable to a particular site, and we only need to store it as aggregate per node in order to get the stats we need. Simple approach, privacy question fully resolved, and no possibility of bad actors using site-specific data.

Is there anything in there that's uniquely identifiable besides the domain? I don't think so. IP address doesn't meet that criteria and is important data as-is for knowing network and geography to suggest things like needing another node Europe. We would need to process it for specific purposes before aggregating it, but that could be done before storing it.

I hadn't thought we would need to store data on a per-site basis at all - aggregating everything would be preferred, but to do that we need to determine what we might need to know in the future so we can record it now. For that, we need to think of (a) meaningful stats and (b) managing a distributed network. The first use case I'd throw in there is knowing how many unique sites in a given region are updating from each node. If more than some threshold of sites from idk, Nepal start updating from servers in California and Germany, we want to work out why there isn't a node they can use that has greater network efficiency for them.

This all leads me to 3 observations:

  1. Withholding telemetry is not a realistic solution
  2. The anonymized data has an important legitimate function
  3. We may have a use case for individualizing some of the data, but not in a user-identifiable way (hashed domain)

I think we can resolve every use case we could come up with for having user-identifiable data, but not certain about some individualized data. If we can get what we need with aggregate data only, we should do so. In my example here, if we don't need a count of how many sites are connecting from Nepal, maybe analytics at each node is enough to know how much traffic is being served to certain regions. But what other use cases might there be? We should sort that out before we start turning everything off.

@costdev
Copy link
Contributor

costdev commented Nov 8, 2024

It's not so much about removing all telemetry, but there are absolutely things that aren't needed to determine an appropriate update package, or to check PHP/WP/DB compatibility or plugin dependency fulfilment.

Here's just a few examples of things we can likely strip out:

  • blogs => $num_blogs
  • users => get_user_count()
  • platform_flags => [ 'os', 'bits' ]
  • image_support => [ 'gd' => [], 'imagick' => [] ]
  • A list of plugins or themes that include an UpdateURI header - which signifies that they are not on the API.
    • For excluding API results that include plugins/themes which have the same slug as one of the non-API ones, we can handle this on the site itself, not on the API server.

There will be more, but that might demonstrate where, like others have been doing recently, we can afford to offer users a choice that they haven't had in the past.

We should evaluate:

  1. What's required for the user's site.
  2. What may be important for AspirePress to make good tech decisions later.
    • e.g. PHP version usage so we can bump when we feel it's safe.
  3. What's nice for marketing, but should absolutely be behind a consent flag.
  4. Any other categories.

@toderash
Copy link

toderash commented Nov 8, 2024

  • num_blogs for a multisite install and user count are maybe ego stats, but do they have a bearing on core dev? Probably not enough of one to warrant collecting anymore, but also has no personal privacy concern associated
  • platform flag is relevant to whether or not to bother supporting IIS
  • image support is relevant for knowing what libraries are available to process uploaded images, not for an individual site, but as a percentage support thing... what's the lowest common denominator out there, and at what point do newer/better options become widely available for image processing?

I think the approach stands that if you disconnect the data from being identified with a specific site, the issue is resolved without even dropping anything. Not that there aren't things we can drop, just that this is the closest thing to a silver bullet solution that doesn't give away the farm for the value of the aggregate data. It's also way easier to implement than dealing with individual telemetry settings.

@costdev
Copy link
Contributor

costdev commented Nov 8, 2024

num_blogs for a multisite install and user count are maybe ego stats, but do they have a bearing on core dev? Probably not enough of one to warrant collecting anymore, but also has no personal privacy concern associated

No real effect on Core dev. There's a separate "multisite_enabled" for tallying the number of single sites/multisite networks. While in theory it could be used to determine whether to support multisite in the future, that's not going to happen. More for graphs than anything.

I agreed that there's no privacy concern there. However, it may still be something we shouldn't slurp just because we can.

platform flag is relevant to whether or not to bother supporting IIS

That doesn't give enough information to isolate IIS, so I wouldn't use the platform flag alone to determine that. I'd collect the server information (visible in Site Health) instead if I wanted to evaluate my Filesystem API maintenance time allocation in the future. I would put this in the category of evaluating the impact of a bug or lack of existing support rather than whether to continue existing support. Of course, AspirePress isn't WordPress Core, but I do think the original reason for it being collected is relevant to whether we value in continuing to collect it.

image support is relevant for knowing what libraries are available to process uploaded images, not for an individual site, but as a percentage support thing... what's the lowest common denominator out there, and at what point do newer/better options become widely available for image processing?

While this can still be behind a flag of sending anonymized data, I agree with what you say about its relevance and usage.


We agree that there's data that's useful for making decisions. However, I don't agree that the issue is necessarily resolved by anonymizing the data. There's a ton of software that has a choice on whether to share anonymous usage data. As you say, it's not that there aren't things we can drop, but I do think we should be clear in how we classify information. That way, we can have the trust of users while also collecting whatever data we feel is necessary or useful.

@asirota
Copy link
Member

asirota commented Nov 8, 2024

Great convo you know I tend to agree with @toderash -- less is more here. One way Hashing the website even rather than providing the actual website would be better. It's still somewhat backward compatible. Could be used for analytics but not identifiable.

Having a huge telemetry project seems a bit out of scope unless there is some other PII data in there.

@chuckadams chuckadams changed the title Strip telemetry from API requests and User Agent strip or anonymize non-critical telemetry from requests Nov 8, 2024
@chuckadams
Copy link
Author

I retitled the issue to reflect a more realistic take on the matter: no one was proposing stripping all telemetry, that would make things useless. In fact an internal managed corporate install ought to get even more telemetry. But out in the big bad world, especially a decentralized ecosystem, this info should be on a need-to-know basis, and enumerating what gets sent where when is a good start.

@toderash
Copy link

toderash commented Nov 8, 2024

wrt multisite/user counts, I was thinking just of scale rather than binary multisite yes/no. 12-18 years ago, there was greater reason to know that, but with WP being a mature platform, we already know it can scale. Would be good to know what statistical use all of the data is for core teams right now.

Key takeaway is that if we have a simple enough approach, it could potentially ship with our first release. From there, we could take more time to work out the necessity of the individual bits of info. It's possible that the simple approach could sanitize the data source enough that it could all be published responsibly, which would be a big win for transparency and for the dev community to benefit from sharing telemetry.

@namithj
Copy link
Contributor

namithj commented Nov 8, 2024

My opinion is this is something that we don't have to tackle right now. Need to do our research, publish all potential overreaches in collecting data. Educate everyone regarding the pitfalls and then launch a solution.

@toderash
Copy link

Agreed, we have a minimal-effort approach to allow us to defer discussion around major changes.

@namithj
Copy link
Contributor

namithj commented Nov 10, 2024

The major issue with this is the discovered issue is mostly just bad programming and it's what they did with the data that's problematic (something they would anyway have, even without the UA manipulation). So just fixing that hole doesn't accomplish anything.

What we should have is to create a document which lists all end points with the Request data and figure out what all data in it is actually required to provide the required response and what all is data overreach. Only this will help us resolve the issue, the first step should be compiling this data.

We can publish this info for everyone to see and there will be a discussion around it. We can provide a solution only once this data is compiled. Anything before that would be simply plugging what's discovered which will provide a false sense of security.

@namithj
Copy link
Contributor

namithj commented Nov 10, 2024

Need to build upon the Gist provided by @costdev and create a blog post or publicly document (blog post?) it with explanations about the data and its requirement.

@asirota
Copy link
Member

asirota commented Nov 10, 2024

I tend to agree on an initial step to document where and what telemetry WordPress is collecting for the purposes of discussion. I'm going to make a documentation issue on this using @costdev gist as a starting point.

@costdev
Copy link
Contributor

costdev commented Nov 10, 2024

Remember that we need to verify the Gist first. AI is great for quick results/indications, but it can be 50/50 accuracy on some of these things.

@asirota
Copy link
Member

asirota commented Nov 10, 2024

Absolutely I'll mentioned that in the ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage needs to be discussed and worked on question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants