Skip to content

Commit

Permalink
Add blog posts to documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
eriknovak committed May 22, 2024
1 parent d7e8da1 commit 84a2a09
Show file tree
Hide file tree
Showing 4 changed files with 175 additions and 0 deletions.
6 changes: 6 additions & 0 deletions docs/blog/.authors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
authors:
eriknovak:
name: Erik Novak
description: Creator
avatar: https://avatars.githubusercontent.com/u/9943382
url: https://github.com/eriknovak
2 changes: 2 additions & 0 deletions docs/blog/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Blog

164 changes: 164 additions & 0 deletions docs/blog/posts/anonymizing-documents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
date: 2024-05-22
authors: [eriknovak]
description: >
Our package can be used to anonymize multiple documents.
categories:
- Tutorial
---

# Anonymizing documents

The `anonipy` package was designed for anonymizing text. However, a lot of text
data can be found in document form, such as PDFs, word documents, and other. Copying
the text from the documents to be anonymized can be cumbersome. The `anonipy` package
provides utility functions that extracts the text from the documents.


In this blog post, we explain how `anonipy` can be used to anonymize texts in
document form.

!!! info "Prerequisites"
To use the `anonipy` package, we must have Python version 3.8 or higher
installed on the machine.

## Installation

Before we start, we must first install the `anonipy` package. To do that, run the
following command in the terminal:

```bash
pip install anonipy
```

This will install the `anonipy` package, which contains all of the required modules.

## Document anonymization

### Extracting the text from the document

Next, we will use the `anonipy` package to anonymize the text in the document.
First, we must extract the text. This can be done using the package's utility
function `open_file`. It uses the [textract](https://textract.readthedocs.io/en/stable/) package to extract the text from different types of documents.


To extract the text, using the following code:

```python
from anonipy.utils.file_system import open_file

file_text = open_file(file_path)
```

where `file_path` is the path to the document we want to anonymize. The `open_file`
will open the document, extract the content, and return it as a string.

Once this is done, we can start anonymizing the text, in a regular way.

### Extracting personal information from the text

Now we can identify and extract personal information from the text. We do this
by using `EntityExtractor`, an extractor that leverages the [GLiNER](https://github.com/urchade/GLiNER) span-based NER models.

It returns the text and the extracted entities.

```python
from anonipy.constants import LANGUAGES
from anonipy.anonymize.extractors import EntityExtractor

# define the labels to be extracted and their types
labels = [
{"label": "name", "type": "string"},
{"label": "social security number", "type": "custom"},
{"label": "date of birth", "type": "date"},
{"label": "date", "type": "date"},
]

# initialize the entity extractor
entity_extractor = EntityExtractor(
labels, lang=LANGUAGES.ENGLISH, score_th=args.score_th
)
# extract the entities from the original text
doc, entities = entity_extractor(file_text)
```

To display the entities in the original text, we can use the `display` method:

```python
entity_extractor.display(doc)
```


### Preparing the anonymization mapping

Next, we prepare the anonymization mapping. We do this by using the generators
module part of the `anonipy` package. The generators are used to generate
substitutes for the entities.

For example, we can use `MaskLabelGenerator` to generate substitutes using the
language models to solve a `mask-filling` problem, i.e. finding the words that
would be probabilistically suitable to replace the entity in the text.

The full list of available generators can be found [here](/documentation/notebooks/02-generators).

Furthermore, we use the `PseudonymizationStrategy` to anonymize the text. More
on anonymization strategies can be found [here](/documentation/notebooks/03-strategies).


```python
# initialize the generators
mask_generator = MaskLabelGenerator()
date_generator = DateGenerator()
number_generator = NumberGenerator()

# prepare the anonymization mapping
def anonymization_mapping(text, entity):
if entity.type == "string":
return mask_generator.generate(entity, text)
if entity.label == "date":
return date_generator.generate(entity, output_gen="middle_of_the_month")
if entity.label == "date of birth":
return date_generator.generate(entity, output_gen="middle_of_the_year")
if entity.label == "social security number":
return number_generator.generate(entity)
return "[REDACTED]"

# initialize the pseudonymization strategy
pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)
```

### Anonymizing the text

Once we prepare the anonymization strategy, we can use it to anonymize the text.

```python
# anonymize the original text
anonymized_text, replacements = pseudo_strategy.anonymize(file_text, entities)
```

### Saving the anonymized text

Finally, we can save the anonymized text to a file. This can be done using the
`write_file` function from the `anonipy.utils.file_system` module.

```python
from anonipy.utils.file_system import write_file

write_file(anonymized_text, output_file, encode="utf-8")
```

Where `output_file` is the path to the file where the anonymized text will be saved.


## Conclusion

In this blog post, we show can one can anonymize documents using the `anonipy` package.
We first used the `open_file` utility function to extract the content of the document
and store it as a string. We then used the `EntityExtractor` to identify and extract
personal information form the text, and the `PseudonymizationStrategy` in combination
with various generators to anonymize the text. Finally, we used the `write_file`
to save the anonymized text to a file.

This process is very straightforward and can be applied to almost any document type.
Furthermore, it can be expanded to process multiple documents written in the same
language at once. Stay tuned to see how this can be done in the future!
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ edit_uri: ""

# Plugins
plugins:
- blog
- search
- mkdocs-jupyter:
include: ["*.ipynb"]
Expand Down Expand Up @@ -78,5 +79,7 @@ nav:
- Generators: documentation/notebooks/02-generators.ipynb
- Strategies: documentation/notebooks/03-strategies.ipynb
- Utility: documentation/notebooks/04-utility.ipynb
- Blog:
- blog/index.md

- Development: development.md

0 comments on commit 84a2a09

Please sign in to comment.