Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a subset of data to be released #40

Open
docsteveharris opened this issue Jun 29, 2017 · 11 comments
Open

Allow a subset of data to be released #40

docsteveharris opened this issue Jun 29, 2017 · 11 comments
Assignees

Comments

@docsteveharris
Copy link
Contributor

at the moment we need to specify all fields
we would prefer just to specify fields that we need
unspecified fields are excluded and then the extract and statistical disclosure control just works with those requested

see https://cchic.slack.com/archives/C2MEV9Y13/p1498490027173478

@docsteveharris
Copy link
Contributor Author

@dpshelio might need some early-ish help with this as we are working on 2 data releases for collaborators. Can you have a think about how much time it would take to fix this and what it would mean for the other work we're asking you to do

@sinanshi
Copy link
Contributor

This is just a security check. With that, no item will be ignored. e.g. I didn't putting DOB in identifiablevar by mistake. The program will give you an error. You have to explicitly specify DOB is a non-identifiable var to make the program to run. If you guys think this is not necessary, I can remove the security check.

@sinanshi
Copy link
Contributor

In this case we can even remove the non-identifiable var slot in the conf file.

sinanshi added a commit that referenced this issue Jun 29, 2017
@docsteveharris
Copy link
Contributor Author

Thanks. So if I understand rightly

before - I had to explicitly state that dob was non-identifiable (and specify all variables as key, sensitive or non-identifying etc
now - I don't need to explicity state that dob is non-identifiable but if I don't then dob is not released anyway/anywhere

@sinanshi
Copy link
Contributor

Correct.

@docsteveharris
Copy link
Contributor Author

If we remove the non-identifiable variable slot then we won't have a way of requesting those variables ...

i.e. the researcher/requester specifies the variables they want and classifies those variables as direct/key/sensitive/non-identifying. We review this and if happy with the classification we run it and that subset of variables is extracted and anonymised as per the classifiication and k/l configuration. We then hand over the data ...

If we remove the non-identifying label we'll have to add back in those variables manually at the end ...

@sinanshi
Copy link
Contributor

sinanshi commented Jun 30, 2017

Sorry, I made a mistake in the previous conversation!!

before - I had to explicitly state that dob was non-identifiable (and specify all variables as key, sensitive or non-identifying etc

Yes.

now - I don't need to explicity state that dob is non-identifiable but if I don't then dob is not released anyway/anywhere

No, dob will be released as it is!!!

  • Variables are removed only when we explicitly specify it as "direct var".
  • Variables are modified only when it is specified as key/sensitive var.
  • Variables will remain untouched (i.e. will be released) if not being specified in direct/key/sensitive var.
  • the rest will be assumed as non-identifiable as default - i.e. remain in the release.

Do you think it is necessary to switch the default to "direct identifiable".

  • direct var: remove from the release.
  • key/sens: modified
  • non-identifiable: remain.
  • the rest: being treated as direct var, i.e. removed.

In the end, the logic becomes -- if the variable does not appear in key/sens/non-identifiable, it will be removed. It makes "direct var" redundant.

@sinanshi
Copy link
Contributor

Are we going to run the conf file directly from the users? There might be a potential security hazard -- one can put a chunk of code in conf file. I do not suggest the users to run their own configuration file unless we make the configuration file safer.

@docsteveharris
Copy link
Contributor Author

I think that we should be semantically consistent so Directvars should be identifiers. If we use Directvar to specify a variable that we want removed, does it appear in the release but with missing replacing all the values, or is it just 'dropped'.

If possible

  • direct var: remove from the release (because it is an identifier, could remain as a column of missingness)
  • key/sens: modified (remains)
  • non-identifiable: remain. / no change but won't be there unless explicitly requested
  • the rest (anything not specifically requested) removed, no column heading, no data at all

Users then

  1. list the columns they want
  2. we provide the classification using pre-agreed definitions
  3. we recommend/provide a k-anon/l-div threshold based on their relationship with us
  4. we provide data with a measure of information loss

If the user is unhappy with the information loss they then need to alter the data request (drop columns) or negotiate a lower k-anon/l-div spec based on their relationship and local security arrangements.

What do you think?
Is this a big change?

@sinanshi
Copy link
Contributor

More intuitive for the users. Not such a big change. It's doable.

@sinanshi
Copy link
Contributor

probably means that i should also have removed the direct identify fields and then we would have a master file that double checked that we weren't being asked for any of these but that might be a later piece of work

@sinanshi sinanshi mentioned this issue Jul 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants