Data Retention #43

haz · 2022-06-01T13:38:26Z

haz
Jun 1, 2022
Maintainer

What are everyone's thoughts on this?

The legacy solver stores just the time it's taken to solve the problem sent, and no further details. This re-implementation stores everything, largely because that was the simplest model under the new framework. My current thought (but not entirely attached to it):

Retain the PDDL / full call payload by default, reject anything over a certain size.
Have an endpoint to wipe the data on a specific call (payload details (like PDDL) removed, but stats like solve time and endpoint retained).
Have an optional flag for all services that changes the retention strategy to "delete after a day" or "delete after retrieved" or whatever.
Clearly stipulate on the landing page what is happening with the data, and why it's being collected.

My thinking on the above stems from a couple of things...(1) it's an open project that anyone can clone / deploy on their own (and we should make this as turn-key as possible), which means data is entirely controlled by them; and (2) it's a service providing a transaction of data for compute. The data retained is a contribution to the planning community -- to be released publicly (no IP's, but no scrubbing of PDDL) for analysis in KEPS-like studies -- and the free (as in $$) service is the exchange in return. I'm imagining studies on how a domain goes from blank PDDL to complete working copy, or cross-section analysis of common errors in a class, or whatever.

Tagging some I know may want to contribute, having worked on some version of the solver or taught courses that may use it (please feel free to add anyone else you might think may be interested): @nirlipo @jan-dolejsi @FlorianPommerening @miquelramirez @ctpelok77

ctpelok77 · 2022-06-01T15:47:18Z

ctpelok77
Jun 1, 2022
Maintainer

Storing data might be problematic, you might need to comply to GDPR.

…

On Wed, Jun 1, 2022 at 9:38 AM Christian Muise ***@***.***> wrote: What are everyone's thoughts on this? The legacy solver stores just the time it's taken to solve the problem sent, and no further details. This re-implementation stores everything, largely because that was the simplest model under the new framework. My current thought (but not entirely attached to it): - Retain the PDDL / full call payload by default, reject anything over a certain size. - Have an endpoint to wipe the data on a specific call (payload details (like PDDL) removed, but stats like solve time and endpoint retained). - Have an optional flag for all services that changes the retention strategy to "delete after a day" or "delete after retrieved" or whatever. - Clearly stipulate on the landing page what is happening with the data, and why it's being collected. My thinking on the above stems from a couple of things...(1) it's an open project that anyone can clone / deploy on their own (and we should make this as turn-key as possible), which means data is entirely controlled by them; and (2) it's a service providing a transaction of data for compute. The data retained is a contribution to the planning community -- to be released publicly (no IP's, but no scrubbing of PDDL) for analysis in KEPS-like studies -- and the free (as in $$) service is the exchange in return. I'm imagining studies on how a domain goes from blank PDDL to complete working copy, or cross-section analysis of common errors in a class, or whatever. Tagging some I know may want to contribute, having worked on some version of the solver or taught courses that may use it (please feel free to add anyone else you might think may be interested): @nirlipo <https://github.com/nirlipo> @jan-dolejsi <https://github.com/jan-dolejsi> @FlorianPommerening <https://github.com/FlorianPommerening> @miquelramirez <https://github.com/miquelramirez> @ctpelok77 <https://github.com/ctpelok77> — Reply to this email directly, view it on GitHub <#43>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEALFOKZTR7CCWPKLZR3VM5RV3ANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Cheers, Michael

1 reply

haz Jun 1, 2022
Maintainer Author

Compliance is easy, no? Delete on demand, don't retain if requested via the API, etc.

ctpelok77 · 2022-06-01T15:52:51Z

ctpelok77
Jun 1, 2022
Maintainer

https://gdpr.eu/compliance-checklist-us-companies/

…

On Wed, Jun 1, 2022 at 11:50 AM Christian Muise ***@***.***> wrote: Compliance is easy, no? Delete on demand, don't retain if requested via the API, etc. — Reply to this email directly, view it on GitHub <#43 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAMHRTGF5LV2LLZ3W2LVM6BCXANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2865238@ github.com>

-- Cheers, Michael

2 replies

haz Jun 1, 2022
Maintainer Author

Sure. But is there some element you think will be problematic with the strategy above?

haz Jun 1, 2022
Maintainer Author

Also, there's no plan to store "personal data" -- violation would only come if people start embedding such data in PDDL comments or some such. No IP's, emails, etc.

ctpelok77 · 2022-06-01T16:59:22Z

ctpelok77
Jun 1, 2022
Maintainer

How can you prevent it? This can be part of the initial state representation for instance. You have no control over this and that's the real problem.

…

On Wed, Jun 1, 2022 at 11:57 AM Christian Muise ***@***.***> wrote: Also, there's no plan to store "personal data" -- violation would only come if people start embedding such data in PDDL comments or some such. No IP's, emails, etc. — Reply to this email directly, view it on GitHub <#43 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAPZO4Y247VSWWIO64LVM6CAHANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2865302@ github.com>

-- Cheers, Michael

1 reply

haz Jun 1, 2022
Maintainer Author

(threads, dude ;))

Sure, and someone can (attempt to) upload a hash of a pirated version of some 90's flick. I think the conversation is mostly vacuous, unless I'm missing something...

Technically if anything is ever stored, you might need to consider GDPR.
In order to solve a planning problem, something needs to be stored.
Ergo, GDPR then is technically involved with any incarnation of PaaS one could imagine, short of keeping the planner running live and passing the PDDL through to inside-planner endpoints (silly, and never happening).

So GDPR must be (technically) be considered for the legacy solver -- PDDL is written to file and then wiped regularly with a cache clean. It doesn't mean we should avoid the service outright.

I think what your core question is getting at is, "does GDPR become significantly more of a concern under the new model?", and I suspect the answer is "No, because we offer (1) easy outs from retention, (2) the ability to wipe it clean later if you want, and (3) no stored personal information we knowingly have on you".

nirlipo · 2022-06-03T07:11:18Z

nirlipo
Jun 3, 2022
Maintainer

Assuming it's complying with legislation, and as you mention, we provide clear explanation about the purpose of the data gathered and allow for opt-out, then it seems that It can build a good dataset for interesting analysis in the future.

Information about the performance of the planners, their logs, and timestamp is quite useful, and doesn't contain any sensitive information. Storing PDDLs can be sensitive, but I don't think we are encrypting the communication either, so storing is not the only weak point. Being an open-source project, it can be deployed in local servers in case the PDDL should be kept private. I'm no expert in the legal&cyber side so I focus mostly about information that can be useful for teaching and research.

Besides planner's performance metrics, having PDDLs with their timestamps, and associated run-ID to relate the planner's performance is really useful. I can see how it can be the basis to build automatic feedback mechanisms, and understanding common errors while modeling.

Does our current implementation store all the data? As soon as the flower instance is restarted, the related DB is wiped. Where are we storing any info now? This is regarding your comment that the reimplementation of the solver stores everything.

4 replies

haz Jun 3, 2022
Maintainer Author

Precisely the type of data that would be a goldmine for KEPS-like questions. Shouldn't be forced, but it could really open the door to model modelling.

Does our current implementation store all the data? As soon as the flower instance is restarted, the related DB is wiped. Where are we storing any info now? This is regarding your comment that the reimplementation of the solver stores everything.

Apparently, I was mistaken! Because of the flower views, I'd assumed it was all stored in a persistent DB. If this isn't the case, then this discussion has now morphed to "do we want to change the retention strategy to anything persistent?" ;)

nirlipo Jun 6, 2022
Maintainer

It's a matter of changing some flags in the command line, and testing what's being stored.

haz Jun 6, 2022
Maintainer Author

Ya, I'd assumed this was in by default. What's the backend -- postgres?

I'm thinking some cron jobs could/should take care of the extra stats (meta data that @miquelramirez points out) and/or removal due to retention.

nirlipo Jun 7, 2022
Maintainer

It uses shelve functionality from python. Pretty much like pickle.

See https://github.com/mher/flower/blob/2cc15c10e900cab20830187a908c8de524169897/flower/events.py#L147

ctpelok77 · 2022-06-03T16:05:14Z

ctpelok77
Jun 3, 2022
Maintainer

I would say "no".

…

On Fri, Jun 3, 2022 at 11:33 AM Christian Muise ***@***.***> wrote: Precisely the type of data that would be a goldmine for KEPS-like questions. Shouldn't be forced, but it could really open the door to model modelling. Does our current implementation store all the data? As soon as the flower instance is restarted, the related DB is wiped. Where are we storing any info now? This is regarding your comment that the reimplementation of the solver stores everything. Apparently, I was mistaken! Because of the flower views, I'd assumed it was all stored in a persistent DB. If this isn't the case, then this discussion has now morphed to "do we want to change the retention strategy to anything persistent?" ;) — Reply to this email directly, view it on GitHub <#43 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEALOCG4PLFZYC2K74ELVNIQWFANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2879310@ github.com>

-- Cheers, Michael

4 replies

haz Jun 3, 2022
Maintainer Author

(again, threading!)

No to what? Storing anything? Even with opt-out? Even with opt-in?

miquelramirez Jun 6, 2022

Not sure if this is the right thread... :)

My opinion on this is that, if it is not too hard to implement, actually you want to have three modes of operation:

Mode no. 1 (and default): only metadata about the solving process is retained
Mode no. 2: metadata and input are retained
Mode no. 3: nothing is retained, other than what may be temporarily needed for the operation of the server, security, etc.

Here's my rationale for the above:

I think that the metadata alone (run time, details about the structure of the instance such as number of predicates, etc.) is a very useful source of information for research purposes. Such metadata could be compared with that extracted from the benchmarks that are being hosted in plannning.domains, to see if instances that the community want to solve are in line - structurally - with the IPC benchmarks. Metadata like this is anonymous (I don't think we can identify the author of the PDDL from stuff like ratio of static to fluent ground atoms) and does not compromise intellectual property (a PDDL instance/domain is a piece of software... since you're programming the planner after all, and information like number of lines, maximum nested level of conditionals in a loop etc. does not seem like it can be reasonably construed as impinging on anyone's rights). It should allow as well for the site to convince corporate sponsors that it is safe for their employees to use it explicitly over HTTPS or implicitly via some IDE plugin or similar.
Mode no.2 is what it says on the tin, but there is an important detail to take care of. You should require the users to sign a copyright waiver form that 1) makes folks think twice about what they are uploading and if they are entitled to do so, and 2) covers your back from anyone wanting you to purge records or sue you for violating their IP. I'd say as well that having some preprocessor raising a red flag whenever a piece of "copyright notice"-like text appears on the PDDL plain text is a wise thing to do, if it is not being done already.
Mode no. 3 would just log basic stuff about the interaction between client and server (date, duration of processing, solver output code). I would also discourage it from being used automatically as a web service (e.g. introduce a cooldown timer server side for instance). This would prevent anyone from developing a "product" based on the service .

The above may sound a bit like over-wargaming this matter, but as Mark Twain is quoted often: reality is stranger than fiction, because fiction has to make sense.

nirlipo Jun 6, 2022
Maintainer

@ctpelok77 , you have a lot of experience from the industry perspective. Would any of the models mentioned above be acceptable to encourage the adoption of these tools and support from potential industry partners? We can think of a default model 3/1, and we can prompt users to please use model 2 to support producing research outputs and gain an understanding of the type of problems being solved. This would avoid the issues of opting out too late, as anyone contributing with their data actually opted-in. That said, of course, the data would be too representative. What would it take to make model 2 acceptable from the industry perspective?

haz Jun 6, 2022
Maintainer Author

I imagine industry for proprietary purposes will go with mode 4 -- on-prem deployment. It's why IBM hosts their own version of GitHub, rather than use the primary version. Would indeed be interested on @ctpelok77 's take, since he's dealt with more IBM lawyers than I have ;).

Some very interesting ideas raised, @miquelramirez . As a form of exchange, opening the compute more or less is an interesting model. "Your data is the product" doesn't need to be a bad thing -- just as long as it's explicit and agreed upon. So low resource on 3 makes sense, medium on 1, and a broader pool of compute for dedicated contribution to the data front in the case of option 2.

I'd like to make it clear that the data isn't being collected for our private purposes, but rather to be package up and reflected back to the research community to explore. I don't think we'd get very far trying to sell personalized PDDL data to advertisers ;) (can you imagine the ads for toy blocks and robotic grippers? fun!), but either way we should make this clear.

Since it was brought up above, GDPR would dictate the functionality to remove data from the system. So everything store should be tied to the unique hash built for the result, and we can assume access to that constitutes ownership -- removal would be a matter of the right DB query to wipe everything associated with that hash, and we can even implement mode 3 using this functionality.

ctpelok77 · 2022-06-06T13:30:47Z

ctpelok77
Jun 6, 2022
Maintainer

Before discussing how to comply with GDPR, I claim that we should make every effort to not need to comply with it. The reason for that is that complying is not the problematic part, it's the bureaucracy around it, as I mentioned before. We would need to have people with particular roles, go through a certain training, etc. Here is the explanation who needs to comply: https://www.termsfeed.com/blog/need-comply-gdpr/#:~:text=The%20GDPR%20states%20that%20any,be%20compliant%20with%20the%20GDPR . So, if you allow someone to submit information in a PDDL like *(and (name Malte) (last Helmert))* and you store it, you need to comply with GDPR. If you like to be able to store PDDLs, there should be other solutions. There can be multiple options: 1. PDDLs stored elsewhere 2. PDDLs are kept but every string is de-anonymized. Here, the question is whether we want to be able to restore. 3. ...

…

On Mon, Jun 6, 2022 at 9:17 AM Christian Muise ***@***.***> wrote: I imagine industry for proprietary purposes will go with mode 4 -- on-prem deployment. It's why IBM hosts their own version of GitHub, rather than use the primary version. Would indeed be interested on @ctpelok77 <https://github.com/ctpelok77> 's take, since he's dealt with more IBM lawyers than I have ;). Some very interesting ideas raised, @miquelramirez <https://github.com/miquelramirez> . As a form of exchange, opening the compute more or less is an interesting model. "Your data is the product" doesn't need to be a bad thing -- just as long as it's explicit and agreed upon. So low resource on 3 makes sense, medium on 1, and a broader pool of compute for dedicated contribution to the data front in the case of option 2. I'd like to make it clear that the data isn't being collected for our private purposes, but rather to be package up and reflected back to the research community to explore. I don't think we'd get very far trying to sell personalized PDDL data to advertisers ;) (can you imagine the ads for toy blocks and robotic grippers? fun!), but either way we should make this clear. Since it was brought up above, GDPR would dictate the functionality to remove data from the system. So everything store should be tied to the unique hash built for the result, and we can assume access to that constitutes ownership -- removal would be a matter of the right DB query to wipe everything associated with that hash, and we can even implement mode 3 using this functionality. — Reply to this email directly, view it on GitHub <#43 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAMZ7B3I73UDRQI5ZGTVNX24ZANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2890784@ github.com>

-- Cheers, Michael

1 reply

miquelramirez Jun 7, 2022

So, if you allow someone to submit information in a PDDL like (and (name
Malte) (last Helmert)) and you store it, you need to comply with GDPR.

I am sorry but that example makes no sense. Is that a jest? Would you mind outlining which is the chain of inferences - taking the website you linked as the knowledge base - that leads to that scenario? What is the "personal information of a user being collected" in that instance?

ctpelok77 · 2022-06-06T13:32:27Z

ctpelok77
Jun 6, 2022
Maintainer

@nir Lipovetzky ***@***.***> I am almost certain that an industry partner would opt out of storing, even just in case.

…

On Mon, Jun 6, 2022 at 9:30 AM Michael Katz ***@***.***> wrote: Before discussing how to comply with GDPR, I claim that we should make every effort to not need to comply with it. The reason for that is that complying is not the problematic part, it's the bureaucracy around it, as I mentioned before. We would need to have people with particular roles, go through a certain training, etc. Here is the explanation who needs to comply: https://www.termsfeed.com/blog/need-comply-gdpr/#:~:text=The%20GDPR%20states%20that%20any,be%20compliant%20with%20the%20GDPR . So, if you allow someone to submit information in a PDDL like *(and (name Malte) (last Helmert))* and you store it, you need to comply with GDPR. If you like to be able to store PDDLs, there should be other solutions. There can be multiple options: 1. PDDLs stored elsewhere 2. PDDLs are kept but every string is de-anonymized. Here, the question is whether we want to be able to restore. 3. ... On Mon, Jun 6, 2022 at 9:17 AM Christian Muise ***@***.***> wrote: > I imagine industry for proprietary purposes will go with mode 4 -- > on-prem deployment. It's why IBM hosts their own version of GitHub, rather > than use the primary version. Would indeed be interested on @ctpelok77 > <https://github.com/ctpelok77> 's take, since he's dealt with more IBM > lawyers than I have ;). > > Some very interesting ideas raised, @miquelramirez > <https://github.com/miquelramirez> . As a form of exchange, opening the > compute more or less is an interesting model. "Your data is the product" > doesn't need to be a bad thing -- just as long as it's explicit and agreed > upon. So low resource on 3 makes sense, medium on 1, and a broader pool of > compute for dedicated contribution to the data front in the case of option > 2. > > I'd like to make it clear that the data isn't being collected for our > private purposes, but rather to be package up and reflected back to the > research community to explore. I don't think we'd get very far trying to > sell personalized PDDL data to advertisers ;) (can you imagine the ads for > toy blocks and robotic grippers? fun!), but either way we should make this > clear. > > Since it was brought up above, GDPR would dictate the functionality to > remove data from the system. So everything store should be tied to the > unique hash built for the result, and we can assume access to that > constitutes ownership -- removal would be a matter of the right DB query to > wipe everything associated with that hash, and we can even implement mode 3 > using this functionality. > > — > Reply to this email directly, view it on GitHub > <#43 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AH4LEAMZ7B3I73UDRQI5ZGTVNX24ZANCNFSM5XRI4G7A> > . > You are receiving this because you were mentioned.Message ID: > <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2890784@ > github.com> > -- Cheers, Michael

-- Cheers, Michael

0 replies

hectorpal · 2022-06-07T11:16:17Z

hectorpal
Jun 7, 2022

Hi folks!
I'm basically close to all Michael is saying. It's not only GDPR, by the way. There also standards for operating in some industries: health care, government, etc.
So, we should be able to reduce the retention to nothing. If the architecture is transparent enough, that make it easier to be certified.

On meta-data: the trend is seen meta-data as both useful and concerning. The concern is that it might enable identifying people or practices of organizations, revealing details of their operations.

So: flexibility and separation of concerns in the software artifacts so it's very clear what's happening.

2 replies

miquelramirez Jun 7, 2022

The concern is that it might enable identifying people or practices of organizations, revealing details of their operations.

Good. Could you please come up with a scenario where the kind of metadata we have been talking about (structural properties of instances, running times, memory crashes) would reasonably lead to the "identification of people or practices of organizations, revealing details of their operations" when no personal data is being collected (name, email address, employer, gender identity, etc.).

So: flexibility and separation of concerns in the software artifacts so it's very clear what's happening.

Hence mode #3 and @haz's mode #4 of operation....

hectorpal Jun 7, 2022

Suppose company A use planning for coordinating operations, and meta data is exposed on instances they solve and how long it took to solve them.

Suppose competitor B is aware of the general business model of A, where A uses planning –thanks to big PR campaign on using AI–, and B has information on the context of A. B could notice how A is solving far more planning problems when there are supply chain issues with some components A uses, and that such solving more problems is related with a deterioration of the time it takes A to solve issues with their clients.

So, B could attack A by disrupting further the supply chain by attempting to buy some of those components while launching a campaign saying they are faster providing similar services.

Example of A and B could be telco companies, and the components could be as simple as Ethernet endings for cable. If A were optimizing inventory but was sensitive to that, they could be subject of an attack.

(This example is inspired by article in the Economist on how highly optimized operations suffer most due to Covid-related disruption of supply chain, and how countries got protectionist)

In training I receive about confidential information, they emphasizes I shall never reveal any issue we might have providing services. Not that I get aware of that kind of information.

Makes sense?

miquelramirez · 2022-06-07T12:39:12Z

miquelramirez
Jun 7, 2022

Okay. FWIW is worth I reviewed very recently some papers from people worried about someone reverse engineering statistics (like a NN) to identify all kinds of sensitive information (like whether somebody's face is in the dataset). So I have read recently quite a few papers both theoretical and applied that demonstrate the (in general) hopelessness of doing so without some very special information.

For those interested, check this and this out, for two didactical, readable examples taken from the literature on the problem posed by the necessity of coming up privacy-preserving statistical databases.

To further ground the discussion, check out the MATILDA project at the University of Melbourne. And let's look at a real-world example of what could look the metadata collected, for instance something like this

![Plot]

So from meta-data like that, is the one below a reasonable scenario?

Suppose competitor B is aware of the general business model of A, where A uses planning –thanks to big PR campaign on using AI–, and B has information on the context of A. B could notice how A is solving far more planning problems when there are supply chain issues with some components A uses, and that such solving more problems is related with a deterioration of the time it takes A to solve issues with their clients.

I cannot honestly see how this is possible with modes of operation 1, 3 or 4.

Mode 2 does leave an opening, and this example you gave suggests the importance of anonymizing (by scrambling) stuff like names of objects, predicates, actions etc. In other words, every stored PDDL should go through a filter that maps every NAME token in the PDDL lexer to some random string of characters.

Miquel.

0 replies

ctpelok77 · 2022-06-07T12:40:09Z

ctpelok77
Jun 7, 2022
Maintainer

@miguel Ramírez ***@***.***> this is no joke. If someone wants to harm you or your organization, they can submit a call with the personal information of EU citizens in mode 2 (I hope I remembered the mode correctly) and report you for GDPR violation. Regardless, the cases when you don't need to comply with GDPR when storing information that might include personal information of EU citizens are scarce. Obviously, a valid PDDL could be made to include such information. Hence, if you allow storing it as is, you need to comply with GDPR. On the topic of meta-data, there are strict rules in organizations such as mine on how long you can retain data for and information derived from that data, who is the data owner, etc. I do not expect people will store data or derived info (meta-data) on uncontrolled servers.

…

On Tue, Jun 7, 2022 at 8:07 AM Hector Palacios ***@***.***> wrote: Suppose company A use planning for coordinating operations, and meta data is exposed on instances they solve and how long it took to solve them. Suppose competitor B is aware of the general business model of A, where A uses planning –thanks to big PR campaign on using AI–, and B has information on the context of A. B could notice how A is solving far more planning problems when there are supply chain issues with some components A uses, and that such solving more problems is related with a deterioration of the time it takes A to solve issues with their clients. So, B could attack A by disrupting further the supply chain by attempting to buy some of those components while launching a campaign saying they are faster providing similar services. Example of A and B could be telco companies, and the components could be as simple as Ethernet endings for cable. If A were optimizing inventory but was sensitive to that, they could be subject of an attack. (This example is inspired by article in the Economist on how highly optimized operations suffer most due to Covid-related disruption of supply chain, and how countries got protectionist) In training I receive about confidential information, they emphasizes I shall never reveal any issue we might have providing services. Not that I get aware of that kind of information. Makes sense? — Reply to this email directly, view it on GitHub <#43 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAIU5KYR6JYQANZS3YTVN43PZANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2897541@ github.com>

-- Cheers, Michael

0 replies

ctpelok77 · 2022-06-07T12:44:25Z

ctpelok77
Jun 7, 2022
Maintainer

@miguel Ramírez ***@***.***> yes, scrambling names would work wrt GDPR.

…

On Tue, Jun 7, 2022 at 8:39 AM Miquel Ramírez ***@***.***> wrote: Okay. FWIW is worth I reviewed some papers from people worried about someone reverse engineering statistics (like a NN) to identify all kinds of sensitive information (like whether somebody's face is in the dataset). So I have read recently quite a few papers both theoretical and applied that demonstrate the (in general) hopelessness of doing so without some very special information. For those interested, check this <https://link.springer.com/chapter/10.1007/11681878_14> and [this](Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures) out, for two didactical, readable examples taken from the literature on the problem posed by the necessity of coming up privacy-preserving statistical databases. To further ground the discussion, check out the MATILDA project <https://acems.org.au/news/matilda-stress-tests-algorithms> at the University of Melbourne. And let's look at a real-world example of what could look the metadata collected, for instance something like this ![Plot] <https://acems.org.au/sites/default/files/conv-instance-space.jpg> So from meta-data like that, is the one below a reasonable scenario? Suppose competitor B is aware of the general business model of A, where A uses planning –thanks to big PR campaign on using AI–, and B has information on the context of A. B could notice how A is solving far more planning problems when there are supply chain issues with some components A uses, and that such solving more problems is related with a deterioration of the time it takes A to solve issues with their clients. I cannot honestly see how this is possible with modes of operation 1, 3 or 4. Mode 2 does leave an opening, and this example you gave suggests the importance of anonymizing (by scrambling) stuff like names of objects, predicates, actions etc. In other words, every stored PDDL should go through a filter that maps every NAME token in the PDDL lexer to some random string of characters. Miquel. — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAKWLR3IAG3I7RRA73TVN47HZANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2897720@ github.com>

-- Cheers, Michael

2 replies

hectorpal Jun 7, 2022

In any case, there is the risk of information leaking and the risk of being liable for information leaking.
Even if the risk is low, regulation and practices win.
Scrambling names might work for GDPR but might not work for other regulations.

We are already in a very detailed zone.
Enabling low/null data retention modes is the safe move.

miquelramirez Jun 7, 2022

there is the risk of information leaking and the risk of being liable for information leaking.

Yes, you have a point. The probability of that happening is positive. As it is positive as well the probability of coming up with an optimal plan for an arbitrary blocks world instance by sampling N > |blocks| uniform random variables over the set of ground operators that includes the NO-OP action. I think somebody showed to me a paper exploiting a similar result for a practical planning algorithm recently 🤔

Clearly, a much higher probability follows from a man in the middle attack. If that is a concern, there is mode no. 4 of operation.

haz · 2022-06-07T14:15:46Z

haz
Jun 7, 2022
Maintainer Author

Such a great discussion -- thank you all!

There is some talking past one another, but not much. What's clear is that any one mode will be problematic, and very restrictive modes will be necessary in order for some to use it.

I do not expect people will store data or derived info (meta-data) on uncontrolled servers.

This statement is either categorically false or true depending on you quantifier for "people".

Me? Store and publicize every iteration of PDDL that I write along the journey to a finished PDDL. It's fascinating data, and I'm willing to share it with the world. I suspect others would be as well.

Company employees? Not a chance in hell, outside of public tutorial stuff.

To recap, these are the 4 proposed modes of operation (1-3, thanks to @miquelramirez ):

only metadata about the solving process is retained
metadata and input are retained
nothing is retained, other than what may be temporarily needed for the operation of the server, security, etc.
public server is not used, and on-prem deploy is used instead (viable since it's FOSS)

Having done my time at IBM, I can almost certainly say that 4 (the most conservative mode) would be the choice within companies. It means your data doesn't even fly out of company networks. Any security expert worth a damn would likely find boat-loads of issues with the public server setup, and recommend against using it for company business.

I don't think the thread of someone putting details they shouldn't in their PDDL should thwart mode 2 entirely. We need to be clear what they're giving us if that's what they sign up for, and make it clear how to take the data down (along with having an easy way to do this), but never storing information for fear of what users might include seems a bit overkill. Could we be DDoS'd logistically? Yes, and in a variety of ways. If it becomes an issue, we wipe the DB of all mode 2 from a problematic time-span, and move to a token-only model (you only get mode 2 if you've been trusted and given a token to share -- e.g., an instructor of a class or PI of a lab or researcher or whatever).

The biggest outstanding issue is what might go into ToS for what qualifies as "metadata". I'd like to err on the side of caution and not have things that can provide identifiable info, but don't think we need to go so far as worry about the competitive company scenario above -- it's unrealistic since any company with real concerns like that really should not be using the public service. Doing so is at their own risk, and mode 4 is the obvious choice. Even from an efficiency standpoint, it makes no sense to base your operations on a server with limited resources and shared among several multi-hundred student classes doing PDDL assignments ;).

Any major objections to the reasoning above? I know there are some decisions in there that will lead to "well company X just isn't going to use the service", and ultimately I think that's fine/expected for models 1-3. Company X really shouldn't be using the current server either since it's controlled by some unknown entity (me) and isn't battle-hardened to protect their PDDL trade secrets.

0 replies

ctpelok77 · 2022-06-07T14:33:32Z

ctpelok77
Jun 7, 2022
Maintainer

@christian Muise ***@***.***> "people" == "people in my org", since the question was about that. @christian Muise ***@***.***> I think you are missing the point of GDPR. The users that give you the data might not even own it. It does not absolve you from the responsibility of not storing personal data. Hence the need to scramble all labels. Can you specify in deployment which modes are permitted for that deployment? Say I want a deployment in which only mode 3 is allowed.

…

On Tue, Jun 7, 2022 at 10:15 AM Christian Muise ***@***.***> wrote: Such a great discussion -- thank you all! There is *some* talking past one another, but not much. What's clear is that any one mode will be problematic, and very restrictive modes will be necessary in order for some to use it. I do not expect people will store data or derived info (meta-data) on uncontrolled servers. This statement is either categorically false or true depending on you quantifier for "people". Me? Store and publicize every iteration of PDDL that I write along the journey to a finished PDDL. It's fascinating data, and I'm willing to share it with the world. I suspect others would be as well. Company employees? Not a chance in hell, outside of public tutorial stuff. To recap, these are the 4 proposed modes of operation (1-3, thanks to @miquelramirez <https://github.com/miquelramirez> ): 1. only metadata about the solving process is retained 2. metadata and input are retained 3. nothing is retained, other than what may be temporarily needed for the operation of the server, security, etc. 4. public server is not used, and on-prem deploy is used instead (viable since it's FOSS) Having done my time at IBM, I can almost certainly say that 4 (the most conservative mode) would be the choice within companies. It means your data doesn't even fly out of company networks. Any security expert worth a damn would likely find boat-loads of issues with the public server setup, and recommend against using it for company business. I don't think the thread of someone putting details they shouldn't in their PDDL should thwart mode 2 entirely. We need to be clear what they're giving us if that's what they sign up for, and make it clear how to take the data down (along with having an easy way to do this), but never storing information for fear of what users might include seems a bit overkill. Could we be DDoS'd logistically? Yes, and in a variety of ways. If it becomes an issue, we wipe the DB of all mode 2 from a problematic time-span, and move to a token-only model (you only get mode 2 if you've been trusted and given a token to share -- e.g., an instructor of a class or PI of a lab or researcher or whatever). The biggest outstanding issue is what might go into ToS for what qualifies as "metadata". I'd like to err on the side of caution and not have things that can provide identifiable info, but don't think we need to go so far as worry about the competitive company scenario above -- it's unrealistic since any company with real concerns like that *really should not be using the public service*. Doing so is at their own risk, and mode 4 is the obvious choice. Even from an efficiency standpoint, it makes no sense to base your operations on a server with limited resources and shared among several multi-hundred student classes doing PDDL assignments ;). Any major objections to the reasoning above? I know there are some decisions in there that will lead to "well company X just isn't going to use the service", and ultimately I think that's fine/expected for models 1-3. Company X really shouldn't be using the current server either since it's controlled by some unknown entity (me) and isn't battle-hardened to protect their PDDL trade secrets. — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAJT3BWZCZAVSVD3AZLVN5KR3ANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2898416@ github.com>

-- Cheers, Michael

1 reply

haz Jun 7, 2022
Maintainer Author

Label scramble can't work on non-parseable PDDL -- and there will be. If someone submits the full copyrighted works of Game of Thrones as a comment to this thread, then GitHub is not automagically at fault. The user violated the copyright, a takedown notice is issued, and the offending data needs to be scrubbed. If we have that process in place, then it adheres to the law, no?

Can you specify in deployment which modes are permitted for that deployment? Say I want a deployment in which only mode 3 is allowed.

I would assume if mode 4 is taken, then whoever deploys can set whatever default they want (mode 3 or otherwise). There's nothing proprietary about the server setup. I reckon an on-prem deploy would be modified further so that it could scale/integrate with their infrastructure. If there's some in-house need for meta data, then mode 4+1. If it's meant to just be a remote endpoint generating plans all day, there mode 4+3. But, ultimately, the mode of operation (defaults, availability, etc) is up to whoever deploys the server. On-prem means it's the company employee.

ctpelok77 · 2022-06-07T15:04:25Z

ctpelok77
Jun 7, 2022
Maintainer

That's not a good example (let's put aside the diff between copyright and personal data protection), github has GDPR officers that went through proper training. Do you want to do that?

…

On Tue, Jun 7, 2022 at 10:41 AM Christian Muise ***@***.***> wrote: Label scramble can't work on non-parseable PDDL -- and there will be. If someone submits the full copyrighted works of Game of Thrones as a comment to this thread, then GitHub is not automagically at fault. The user violated the copyright, a takedown notice is issued, and the offending data needs to be scrubbed. If we have that process in place, then it adheres to the law, no? Can you specify in deployment which modes are permitted for that deployment? Say I want a deployment in which only mode 3 is allowed. I would assume if mode 4 is taken, then whoever deploys can set whatever default they want (mode 3 or otherwise). There's nothing proprietary about the server setup. I reckon an on-prem deploy would be modified further so that it could scale/integrate with their infrastructure. If there's some in-house need for meta data, then mode 4+1. If it's meant to just be a remote endpoint generating plans all day, there mode 4+3. But, ultimately, the mode of operation (defaults, availability, etc) is up to whoever deploys the server. On-prem means it's the company employee. — Reply to this email directly, view it on GitHub <#43 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAJ5PQYXWQWAEPPMXXTVN5NQLANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/2898581@ github.com>

-- Cheers, Michael

1 reply

haz Jul 8, 2022
Maintainer Author

Obviously not ;). So if I create something and host it on Heroku, who's responsible for the GDPR protocols? What if I openly invite people to submit their models to a repo on GitHub? What if they email models to me directly, and I host a zip of them all on my own website?

What I'm trying to peel apart is the subtly that delineates a reasonable need to adhere to GDPR full-on, versus those cases where it's just not an issue.

ctpelok77 · 2022-07-08T10:24:13Z

ctpelok77
Jul 8, 2022
Maintainer

These are all very good questions, which I don't have an answer to. It is my understanding that with GDPR, it's better to be safe than sorry.

…

On Thu, Jul 7, 2022, 11:07 PM Christian Muise ***@***.***> wrote: Obviously not ;). So if I create something and host it on Heroku, who's responsible for the GDPR protocols? What if I openly invite people to submit their models to a repo on GitHub? What if they email models to me directly, and I host a zip of them all on my own website? What I'm trying to peel apart is the subtly that delineates a reasonable need to adhere to GDPR full-on, versus those cases where it's just not an issue. — Reply to this email directly, view it on GitHub <#43 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4LEAOMGGZMBSDXNO3JRE3VS6LOZANCNFSM5XRI4G7A> . You are receiving this because you were mentioned.Message ID: <AI-Planning/planning-as-a-service/repo-discussions/43/comments/3104518@ github.com>

1 reply

haz Jul 8, 2022
Maintainer Author

Well we gotta draw the line somewhere. I'm hosting all of the classical planning domains in a repo (storage is GitHub), and members of the community have willingly submitted such elements to the IPC for public release. If "better safe than sorry" means we shouldn't host those PDDL sets anywhere (which does include some PII in the comments), then I think we need to take a pass on that as a community. It's obviously not what the law was created for, and the extreme hypothetical shouldn't prohibit the dream of open and reproducible research.

Thank you for picking through this all -- I think it's an important conversation to have. My current thinking is that the path forward is minimal storing by default that avoids PII altogether, and opt-in offering of data for those willing to contribute to the community of knowledge engineering. I think we can incentivize the latter based on compute resources, and have something that not only serves the community with functionality but also allows the community to serve themselves with a rich understanding of the modelling process.

Data Retention #43

haz Jun 1, 2022 Maintainer

Replies: 15 comments · 20 replies

ctpelok77 Jun 1, 2022 Maintainer

haz Jun 1, 2022 Maintainer Author

ctpelok77 Jun 1, 2022 Maintainer

haz Jun 1, 2022 Maintainer Author

haz Jun 1, 2022 Maintainer Author

ctpelok77 Jun 1, 2022 Maintainer

haz Jun 1, 2022 Maintainer Author

nirlipo Jun 3, 2022 Maintainer

haz Jun 3, 2022 Maintainer Author

nirlipo Jun 6, 2022 Maintainer

haz Jun 6, 2022 Maintainer Author

nirlipo Jun 7, 2022 Maintainer

ctpelok77 Jun 3, 2022 Maintainer

haz Jun 3, 2022 Maintainer Author

nirlipo Jun 6, 2022 Maintainer

haz Jun 6, 2022 Maintainer Author

ctpelok77 Jun 6, 2022 Maintainer

ctpelok77 Jun 6, 2022 Maintainer

ctpelok77 Jun 7, 2022 Maintainer

ctpelok77 Jun 7, 2022 Maintainer

haz Jun 7, 2022 Maintainer Author

ctpelok77 Jun 7, 2022 Maintainer

haz Jun 7, 2022 Maintainer Author

ctpelok77 Jun 7, 2022 Maintainer

haz Jul 8, 2022 Maintainer Author

ctpelok77 Jul 8, 2022 Maintainer

haz Jul 8, 2022 Maintainer Author

haz
Jun 1, 2022
Maintainer

Replies: 15 comments 20 replies

ctpelok77
Jun 1, 2022
Maintainer

haz Jun 1, 2022
Maintainer Author

ctpelok77
Jun 1, 2022
Maintainer

haz Jun 1, 2022
Maintainer Author

haz Jun 1, 2022
Maintainer Author

ctpelok77
Jun 1, 2022
Maintainer

haz Jun 1, 2022
Maintainer Author

nirlipo
Jun 3, 2022
Maintainer

haz Jun 3, 2022
Maintainer Author

nirlipo Jun 6, 2022
Maintainer

haz Jun 6, 2022
Maintainer Author

nirlipo Jun 7, 2022
Maintainer

ctpelok77
Jun 3, 2022
Maintainer

haz Jun 3, 2022
Maintainer Author

nirlipo Jun 6, 2022
Maintainer

haz Jun 6, 2022
Maintainer Author

ctpelok77
Jun 6, 2022
Maintainer

ctpelok77
Jun 6, 2022
Maintainer

ctpelok77
Jun 7, 2022
Maintainer

ctpelok77
Jun 7, 2022
Maintainer

haz
Jun 7, 2022
Maintainer Author

ctpelok77
Jun 7, 2022
Maintainer

haz Jun 7, 2022
Maintainer Author

ctpelok77
Jun 7, 2022
Maintainer

haz Jul 8, 2022
Maintainer Author

ctpelok77
Jul 8, 2022
Maintainer

haz Jul 8, 2022
Maintainer Author