Presidio Image Redactor - improve scalability and design #1049

omri374 · 2023-03-19T08:20:25Z

omri374
Mar 19, 2023
Maintainer

Context / Problem Statement

The Presidio Image Redactor package was developed as beta as part of the effort on Presidio V2 in January 2021. It features a simple OCR pipeline which extracts text, parses it, and sends it to the Presidio Analyzer package. Once PII is identified, bounding boxes are matched with RecognizerResult objects and those bounding boxes are redacted.

Two significant contributions to the package, one focusing on DICOM and the other on QR scanning, in addition to limited accuracy and extensibility due to the existing design, express the need to improve the package's performance and generalizability to new use cases, starting from DICOM and going to non-textual PII objects such as faces, or semi-textual objects like QR codes or license plate numbers.

This ADR proposes an architecture change to presidio-image-redactor to make it more similar to the structure of presidio-analyzer and presidio-anonymizer that are used in the text version of Presidio.

This is the proposed high-level flow:

flowchart LR
    style A fill:#bbf,color:#fff
    style B fill:#bbf,color:#fff
    style C fill:#bbf,color:#fff
    style User fill:#bbf,color:#fff

    User
    subgraph ImageAnalyzerEngine
        subgraph TextInImageRecognizers
           OCRTextRecognizer
           QRTextRecognizer
           ...
        end
        subgraph PiiObjectsRecognizers
           FacesRecognizer
           LicensePlateRecognizer
           ....
        end
        TextInImageRecognizers --> match_pii_to_boxes
        PiiObjectsRecognizers --> match_pii_to_boxes
        
        
    end
    User -- "PII in text" --> TextInImageRecognizers
    User -- "PII in visual objects" --> PiiObjectsRecognizers


    match_pii_to_boxes --> A["return raw results"]
    match_pii_to_boxes --> B["return redacted image"]
    match_pii_to_boxes --> C["overlay detected PII on image"]

And in more detail:

stateDiagram-v2
    FileReader
    state ImageAnalyzerEngine {

      state TextInImageRecognizers {
        QRTextRecognizer
        OCRBase --> TesseractOcr
        OCRBase --> AzureFormsRecognizer
        OCRBase --> AzureOCR
        OCRBase --> ...
        TesseractOcr --> OCRTextRecognizer
        AzureOCR --> OCRTextRecognizer
        TesseractOcr --> DicomTextRecognizer
        AzureOCR --> DicomTextRecognizer
        AzureFormsRecognizer --> FRPiiRecognizer

        QRTextRecognizer --> presidio_analyzer
        FRPiiRecognizer --> presidio_analyzer

        DicomTextRecognizer --> presidio_analyzer
        OCRTextRecognizer --> presidio_analyzer

        
      }    
      state PiiObjectRecognizers {
        FacesRecognizer
        LicensePlatesRecognizer
        CustomPiiObjectRecognizer
        ....

      }
      TextInImageRecognizers --> match_pii_to_boxes
      PiiObjectRecognizers --> match_pii_to_boxes
    }
    
    
    FileReader --> ImageAnalyzerEngine

    state ImageAnonymizerEngine {
    redact_image
    highlight_bounding_boxes
    }
    ImageAnalyzerEngine --> ImageAnonymizerEngine
    redact_image --> return_redacted_image
    highlight_bounding_boxes --> return_image_with_highlights
    ImageAnalyzerEngine --> return_pii_locations

Like the structure of the AnalyzerEngine object, the new ImageAnalyzerEngine would contain a list of recognizers, one for each type of detection logic.

The recognizer would contain the logic for:

Extraction of objects/text from the image + optional metadata / structure / key-value lists.
Detecting PII (using other Presidio modules or not).
Creating a bounding box / polygon object which contains the location of the object in the image, including the metadata (entity type, decision logic, entity value and more) for further analysis / de-identification.

The output of the ImageAnalyzerEngine would then be inputted into the ImageAnonymizerEngine, that would expose oprators such as redact or validate. Users would be able to create additional types of operators.

Note: this ADR does not include the detailed design, so further investigation is required to understand how and if the different components (e.g. OCRBase) can be generalized.

Consequences

Improved customizability to new use-cases without the need to refactor the existing objects.
Improved extensibility to new types of detectors, or to replace existing logic with an alternative.
Easier evaluation as the interim results could be returned to the user.
Extensibility to new types of operators
More similar structure to the text version, which makes it easier for existing users to become familiar with the code.
This change requires a significant investment, which would take time to implement.
No backward compatibility: The proposed changes would break the existing structure of the package, which is already downloaded ~6k times a month. Proper deprecation of previous APIs is required

Links

Discussion on PR 1036: Adding QR codes support in the ImageRedactorEngine #1036
Issue 1012: Combine redactor engines with verification engines.
Issue 1013: Validate OCR output schema
Additional related issues: 922, 1013.

omri374 · 2023-03-19T08:21:16Z

omri374
Mar 19, 2023
Maintainer Author

cc @SharonHart @navalev @niwilso @vpvpvpvp

0 replies

vpvpvpvp · 2023-03-22T18:36:40Z

vpvpvpvp
Mar 22, 2023

Hello! I'm trying to get a better understanding of how the image processing flow happens from the beginning up to the ImageAnonymizerEngine. Suppose we have a list of recognizers, each of them recognizing certain visual objects (text, faces, QR). The base class for recognizers and the format of the returned data is something like this:

class RecognizerBase(ABC):
    """A class representing an abstract visual objects recognizer.
    """

    @abstractmethod
    def recognize(self, image: object) -> List[RecognizerResult]:
        """Recognize visual objects

        :param image: PIL Image/numpy array to be processed

        :return: List of the recognized objects
        """
        ...

class RecognizerResult:
    """Represent the results of analysing the image by recognizer.
    """

    entity_type: str
    reconizer_name: str
    text: Optional[str] = None
    bbox: Bbox
    polygon: Optional[Polygon] = None

In the ImageAnalyzerEngine we processed the image with each of the recognizers and got the List[RecognizerResult]. With the RecognizerResult, which have no text, everything seems clear, and we can pass them further to the ImageAnonymizerEngine. But with the RecognizerResult that have text, we need an additional (possibly optional) processing step with thepresidio-analyzer. And then for each RecognizerResult with text, we get List[PresidioResults]. The tricky part now is how to do the mapping from the List[PresidioResults] back to theList[RecognizerResult]. An example to make it clear:

# For this the result obtained by TesseractOCR...
RecognizerResult:
    entity_type: "text"
    reconizer_name: TesseractOCR
    text: "My name is Jhon Doe, My phone number is 212-555-5555"
    bbox: Bbox

# we get two PII entities by presidio-analyzer...
[
    type: PERSON, start: 11, end: 19, score: 0.85, 
    type: PHONE_NUMBER, start: 40, end: 52, score: 0.75
]

# Now we need to somehow map them back to the boxes/polygons

And depending on the type of recognizer, the mapping logic may vary (for example, it will be different for the TesseractOCR and QR codes). One of the options is to create a separate class for recognizers who deal with text data TextRecognizer(RecognizerBase) with a specific method for mapping:

class TextRecognizer(RecognizerBase, ABC):

    @abstractmethod
    def map_pii_to_boxes(
        result: List[RecognizerResult], pii: List[PresidioResults]
    ) -> List[RecognizerResult]:
        ...

Or something like that. Do you guys see the image processing flow roughly in the same direction?

5 replies

omri374 Mar 23, 2023
Maintainer Author

Hi @vpvpvpvp, thanks for reviewing this! I agree that the mapping is not that straightforward and dependent on the recognizer itself. Having all text recognizers expose another method for mapping is a good idea. We would need to let every result know from which recognizer it came from (we do a similar thing in presidio-analyzer) to be able to map results back to the recognizers they came from. WDYT?

omri374 Mar 23, 2023
Maintainer Author

Alternatively we can couple the text with its bounding box and send it to presidio-analyzer. The limitation here would be that only a small section (even a word) would be sent instead of the full text. Maybe we can go around that somehow.

vpvpvpvp Mar 23, 2023

Or, it is possible just to make TextRecognizer directly dependent on presidio-analyzer (different TextRecognizers can use the same presidio-analyzer instance). Then there will be no need to expose a separate method for mapping, since all the logic will be in the recognize method (as with non-text related recognizers).

class TextRecognizer(RecognizerBase):

    def __init__(self, presidio_analyzer: AnalyzerEngine):
        self.presidio_analyzer = presidio_analyzer
        ....

Regarding how to map the result back to the recognizer it came from. The simplest thing is to just store a field like recognizer/recognizer_name in the RecognizerResult.

omri374 Mar 23, 2023
Maintainer Author

Yes, that's a good idea! Thanks

omri374 Mar 26, 2023
Maintainer Author

@vJenny @shanepeckham if you have some time, I'd be very interested to get your input on this design, especially to see if it helps in any way to integrate your obfuscation code sometime in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presidio Image Redactor - improve scalability and design #1049

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Presidio Image Redactor - improve scalability and design #1049

omri374 Mar 19, 2023 Maintainer

Context / Problem Statement

Consequences

Links

Replies: 2 comments · 5 replies

omri374 Mar 19, 2023 Maintainer Author

vpvpvpvp Mar 22, 2023

omri374 Mar 23, 2023 Maintainer Author

omri374 Mar 23, 2023 Maintainer Author

vpvpvpvp Mar 23, 2023

omri374 Mar 23, 2023 Maintainer Author

omri374 Mar 26, 2023 Maintainer Author

omri374
Mar 19, 2023
Maintainer

Replies: 2 comments 5 replies

omri374
Mar 19, 2023
Maintainer Author

vpvpvpvp
Mar 22, 2023

omri374 Mar 23, 2023
Maintainer Author

omri374 Mar 23, 2023
Maintainer Author

omri374 Mar 23, 2023
Maintainer Author

omri374 Mar 26, 2023
Maintainer Author