Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLEDE : Optimal “Optimization & Performance” reporting for training bidding models #101

Open
abdellah-lamrani-alaoui opened this issue Feb 4, 2021 · 1 comment
Labels
FLEDGE Non-breaking Feature Request Feature request for functionality unlikely to break backwards compatibility

Comments

@abdellah-lamrani-alaoui
Copy link

abdellah-lamrani-alaoui commented Feb 4, 2021

Hello,

As it was recently mentioned in another FLEDGE issue (#93 ), DSP and advertisers need access to a reporting capability allowing them to train machine-learning models in order to optimize their campaigns. This reporting is paramount for buy-side actors to be able to learn meaningful information about the contexts driving performance, without gaining information on the browsing behavior of a given individual. As already mentioned in the reporting in SPARROW proposal, these reports should provide info on contextual and interest-based features together

“Any weakness in leveraging both signals together would undoubtedly hurt both the publisher's revenue and the user experience, exposing them to irrelevant advertisements or worse, unsafe content.”

To sum up, we think the proposal should seek to maximize the good information the buy-side actors get (i.e. what allows them to price impressions the most accurately, thus driving investment ups and maximizing overall wealth) under a given well-defined privacy constraint (for instance k-anonymity with differential checks). At Scibids, we see this accurate reporting capability as the single main variable explaining whether or not the world largest advertisers we work for will continue to get the results they expect, and thus investing, in programmatic advertising.

We have worked on the subject and will submit in the next 2 weeks a proposal of what an optimal reporting procedure should look like but would be happy to have your thoughts on the ideas belows and to see which form it could take in the FLEDGE implementation.

Why we think buy-side actors can get from no useful info to almost 100% of useful info under the same k-anonymity privacy constraint

Our main motivation to propose something new lies in how hurtful a “blunt” k-anonymity + differential check would be for machine learning practitioners. Indeed let’s consider the simplest k-anonymity procedure with differential checks:

(blunt method) When asking for a report (e.g. in all generality a click report) the DSP has to provide columns cj=1...J. Any line (c1=x1,...,cj=xj) will then be included if it only concerns more than k users, or else it will be discarded. A “remainder” line will indicate the number of clicks associated to the obfuscated lines, provided this remainder line concerns more than k different users.

With this method, DSPs would be faced with the impossible task to specify for each campaign what columns they want:

  • asking too many columns would mean getting no data on a large number of impressions (this will very soon be most of the impressions given the high cardinality of important fields like “domain” or “placement”)
  • asking too few columns means building simplistic underperforming models. Moreover a priori column selection means that these few columns won’t probably be the most discriminating for predicting click.

Please note that this problem cannot be alleviated by asking for a different column set if the first set gives unsatisfactory results, since differential checks would greatly affect the result of this 2nd report.

Improvements over this method have already been proposed in the “reporting in SPARROW” proposal, as recalled below.

(RIS method) When asking for a report (e.g. in all generality a click report) the DSP has to provide an ordered set of columns cj=1...J. If c1=x1 concerns more than k users (c1=x1,...,cj=xj) will then be included* as is if it concerns more than k users. If the line (c1=x1,...,cj=xj) concerns less than k users, it will appear in the (c1=x1,...,cj-1=xj-1,cj=hidden) line instead if (c1=x1,...,cj-1=xj-1) concerns at least k users. The procedure applies recursively.

*with the only exception when (c1=x1,...,cj=xj) concerns the least number of users to be added to the (c1=x1,...,cj-1=xj-1,cj=hidden) such that this line would concern more than k users.

RIS provides a first incremental improvement to a blunt k-anonymity reporting by introducing a notion of “feature importance” which allows :

  • getting the report that best fits the different actor needs (some actors would be more interested in knowing the domain information whereas others would prefer getting the size information).
  • getting more information while ensuring the k-anonymity.

However this approach does not solve the problem of requiring the DSP to set in stone an a priori importance of the variables, which is going to vary a lot depending on campaigns typologies and KPIs. This seems suboptimal since the reporting entity has full knowledge of the campaign dynamics and is thus much better equipped to solve the “provide max useful information while respecting k-anonymity” problem than the DSP. There is actually quite a large literature on “optimal k-anonymization” of a dataset and we have probably an opportunity to finally use this work for large-scale practical applications.

We are thus going to propose something along these two axis of improvement:

  • Instead of having to provide the order of features (for a new campaign we often don’t have a prior on the order of importance of features), we could let the option to the reporting algorithm to smartly choose the order of features and provide the k-anonymity report that maximizes a measure of the information (for instance the entropy) in the report. (this problem is known as the K-anonymity problem in the litterature and we are currently looking into adapting the DataFly algorithm.Eager to hear from you if you have heard of better scalable solutions!).
  • Instead of deleting or marking as hidden the rows that do not respect the k-anonymity threshold, we could add some generalization by bucketizing the features. For example, for the line cnn.com / age=24, if we have less than k rows, the reporting algorithm could try with the bucket age in [20,25].

Let’s take an example with k=2:

If we have :

domain age number post view conversion
cnn.com 24 0
cnn.com 24 1
cnn.com 23 1

Instead of :

domain age number post view conversion
cnn.com hidden 0
cnn.com hidden 1
cnn.com hidden 1

We would get :

domain age number post view conversion
cnn.com 20-25 0
cnn.com 20-25 1
cnn.com 20-25 1

We are still hammering out the details but we wanted to share our vision before working on a more detailed implementation with a reasonable algorithmic complexity that would fit with the large scale datasets in our industry. .

@michaelkleber
Copy link
Collaborator

This seems like a very interesting idea! I have a few question.

  1. From a practical point of view, the idea of choosing the best ordering of features seems like something that can only be done by a party that looks at all the data. Our proposed aggregate reporting machinery doesn't have any server that is trusted to know the raw data; it relies on Secure Multi-Party Computation to avoid that. Do you think your idea is compatible with this model of no trusted aggregator?

  2. As with all the Privacy Sandbox APIs, we would prefer to give the ad tech companies that are going to rely on them as much flexibility and choice as we can, resorting to the browser making hard-to-understand decisions only when we have no other option. I don't understand enough about your proposal to know whether this would trigger that hard-to-understand risk. If the same entity got reports on two different days that preserved or redacted different fields, with no way to control or understand why they were different, I worry about their being unhappy. Is there any system that works like this today, or would it be an entirely new type of uncertainty that people would need to get used to?

@JensenPaul JensenPaul added the Non-breaking Feature Request Feature request for functionality unlikely to break backwards compatibility label Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FLEDGE Non-breaking Feature Request Feature request for functionality unlikely to break backwards compatibility
Projects
None yet
Development

No branches or pull requests

3 participants