Pseudonymization operator - initial discussion #1118

omri374 · 2023-07-12T05:21:28Z

omri374
Jul 12, 2023
Maintainer

DRAFT

Context / Problem Statement

Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis. (source: wikipedia)

In the context of Presidio, a pseudonymization operator would replace entity values with synthetic values, while maintaining a 1:1 mapping between the original value and the synthetic value.

There are some decision factors to be addressed:

How are synthetic values generated? are they generated by the client/user and passed into Presidio Anonymizer? Are they generated as part of the Presidio Anonymizer process?
Entity mappings between original and synthetic are sensitive, as they contain PII. How can we make sure users have the ability to store them securely?
Entity mappings could become long in large-scale or long-lasting implementations
Entity mappings should not change between calls to Presidio Anonymizer or when calling Presidio Anonymizer in parallel
How should the user perform de-pseudonymization? Is this something that the user/client is responsible for, or should it be supported by Presidio Deanonymizer?

Considered options

1. Maintain an entity-mapping object within Presidio Anonymizer + pass a lambda for logic on how to generate a new value

Pros:
- Simplest solution
- No API changes but the addition of the lambda
Cons:
- Would not maintain persistency across multiple Presidio Anonymizer objects
- Entity mapping is not kept securely, adds a state to a currently stateless process,
- Complexity of passing the synthetic data generation logic through a REST API

2. Have the user pass an entity-mapping dictionary and get an updated entity-mapping as part of the response, while the generation of a new mapping between original and synthetic is the responsibility of Presidio Anonymizer

Pros:
- No need to persist the mapping within Presidio
- The user can decide on the logic for generating a synthetic value through a Python lambda
Cons:
- API changes are required as entity mappings should be passed as input to the anonymizer + the anonymizer needs to return the updated mappings.
- Entity mappings could be large / impossible to pass due to sensitivity
- Synchronizing entity mappings in parallel computing scenarios is challenging and dependent on the parallelism/scale approach and framework
- Defining the logic for generation is difficult through REST APIs

3. Have the user pass an entity-mapping dictionary which is meant to already be updated, so the responsibility of generating a new mapping is the client's

Pros:
- No need to persist the mapping within Presidio
- The user can decide on the logic for generating a synthetic value on the client side
- No need to pass the generation logic through the API
- Simple API change required
Cons:
- Requires the implementation of the mapping on the client side
- Entity mappings could be large

Use the existing Replace operator and maintain the entire mapping logic on the client side.
- Pros:
  - No change needed
  - Full flexibility for the client side
- Cons:
  - Requires the most work on the client side
  - Anonymizer features such as old and new spans would need to be handled by the user

Other options?

If you are working on a pseudonymization use case, please share your feedback on this.

@feynmanliang
@lordlinus
@SharonHart

omri374 · 2024-02-13T12:37:04Z

omri374
Feb 13, 2024
Maintainer Author

FYI, we now have a sample for pseudonymization: https://github.com/microsoft/presidio/blob/main/docs/samples/python/pseudonomyzation.ipynb
The sample is aligned with option 2 in this discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pseudonymization operator - initial discussion #1118

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Pseudonymization operator - initial discussion #1118

omri374 Jul 12, 2023 Maintainer

Context / Problem Statement

Considered options

1. Maintain an entity-mapping object within Presidio Anonymizer + pass a lambda for logic on how to generate a new value

2. Have the user pass an entity-mapping dictionary and get an updated entity-mapping as part of the response, while the generation of a new mapping between original and synthetic is the responsibility of Presidio Anonymizer

3. Have the user pass an entity-mapping dictionary which is meant to already be updated, so the responsibility of generating a new mapping is the client's

Other options?

Replies: 1 comment

omri374 Feb 13, 2024 Maintainer Author

omri374
Jul 12, 2023
Maintainer

omri374
Feb 13, 2024
Maintainer Author