Skip to content

Commit

Permalink
Merge pull request #19 from eriknovak/feature/pattern-extractor
Browse files Browse the repository at this point in the history
Adds a new Pattern Extractor and updated the package documentation
  • Loading branch information
eriknovak authored Jul 16, 2024
2 parents d0af713 + 56f700a commit 5510030
Show file tree
Hide file tree
Showing 67 changed files with 3,746 additions and 4,386 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,6 @@ data/**/
!data/README.md

notebooks
!docs/documentation/notebooks
!docs/how-to-guides/notebooks

scripts
16 changes: 8 additions & 8 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
anonipy-0.0.8 (2024-06-17)
### anonipy-0.0.8 (2024-06-17)

- Add automatic date format detection support to DateGenerator

anonipy-0.0.7 (2024-06-06)
### anonipy-0.0.7 (2024-06-06)

- Upgrade gliner-spacy to have cleaner code
- Add function to help manual post-anonymization replacement fixing

anonipy-0.0.6 (2024-05-31)
### anonipy-0.0.6 (2024-05-31)

- Add GPU support and entity scores to EntityExtractor
- Standardize the function naming in strategies

anonipy-0.0.5 (2024-05-29)
### anonipy-0.0.5 (2024-05-29)

- Re-implement file reading methods + add unit tests
- Expland the test environment on all OS

anonipy-0.0.4 (2024-05-27)
### anonipy-0.0.4 (2024-05-27)

- Add unit tests
- Fix the LANGUAGES constant
- Refine the Entity implementation
- Update documentation

anonipy-0.0.3 (2024-05-22)
### anonipy-0.0.3 (2024-05-22)

- Add read_json function
- Add write_json function
Expand All @@ -33,11 +33,11 @@ anonipy-0.0.3 (2024-05-22)
- Reduce the number of viable suggestions used to create a substitute in MaskLabelGenerator
- Add the entity label to the replacements in strategies

anonipy-0.0.2 (2024-05-22)
### anonipy-0.0.2 (2024-05-22)

- Add write_file function
- Add blog to the documentation

anonipy-0.0.1 (2024-05-21)
### anonipy-0.0.1 (2024-05-21)

- Initial release
30 changes: 14 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,26 +30,24 @@

The anonipy package is a python package for data anonymization. It is designed to be simple to use and highly customizable, supporting different anonymization strategies. Powered by LLMs.

## Requirements
## Requirements
Before starting the project make sure these requirements are available:

- [python]. The python programming language (v3.8, v3.9, v3.10, v3.11).

## 💾 Install
## Install

```bash
pip install anonipy
```

## ⬆️ Upgrade
## Upgrade

```bash
pip install anonipy --upgrade
```

## 🔎 Example

The details of the example can be found in the [Overview](https://eriknovak.github.io/anonipy/documentation/notebooks/00-overview.ipynb).
## Example

```python
original_text = """\
Expand Down Expand Up @@ -77,14 +75,14 @@ Use the language detector to detect the language of the text:
```python
from anonipy.utils.language_detector import LanguageDetector

lang_detector = LanguageDetector()
language = lang_detector(original_text)
language_detector = LanguageDetector()
language = language_detector(original_text)
```

Prepare the entity extractor and extract the personal infomation from the original text:

```python
from anonipy.anonymize.extractors import EntityExtractor
from anonipy.anonymize.extractors import NERExtractor

# define the labels to be extracted and anonymized
labels = [
Expand All @@ -94,14 +92,14 @@ labels = [
{"label": "date", "type": "date"},
]

# language taken from the language detector
entity_extractor = EntityExtractor(labels, lang=language, score_th=0.5)
# initialize the NER extractor for the language and labels
extractor = NERExtractor(labels, lang=language, score_th=0.5)

# extract the entities from the original text
doc, entities = entity_extractor(original_text)
doc, entities = extractor(original_text)

# display the entities in the original text
entity_extractor.display(doc)
extractor.display(doc)
```

Use generators to create substitutes for the entities:
Expand All @@ -123,9 +121,9 @@ def anonymization_mapping(text, entity):
if entity.type == "string":
return llm_generator.generate(entity, temperature=0.7)
if entity.label == "date":
return date_generator.generate(entity, output_gen="middle_of_the_month")
return date_generator.generate(entity, output_gen="MIDDLE_OF_THE_MONTH")
if entity.label == "date of birth":
return date_generator.generate(entity, output_gen="middle_of_the_year")
return date_generator.generate(entity, output_gen="MIDDLE_OF_THE_YEAR")
if entity.label == "social security number":
return number_generator.generate(entity)
return "[REDACTED]"
Expand All @@ -143,7 +141,7 @@ pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)
anonymized_text, replacements = pseudo_strategy.anonymize(original_text, entities)
```

## 📖 Acknowledgements
## Acknowledgements

[Anonipy](https://eriknovak.github.io/anonipy/) is developed by the
[Department for Artificial Intelligence](http://ailab.ijs.si/) at the
Expand Down
30 changes: 10 additions & 20 deletions anonipy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,15 @@
"""
anonipy
The anonipy package provides utilities for data anonymization.
Submodules
----------
anonymize :
The package containing anonymization classes and functions.
utils :
The package containing utility classes and functions.
definitions :
The object definitions used within the package.
constants :
The constant values used to help with data anonymization.
"""`Anonipy` is a text anonymization package.
The `anonipy` package provides utilities for data anonymization. It provides
a set of modules and utilities for (1) identifying relevant information
that needs to be anonymized, (2) generating substitutes for the identified
information, and (3) strategies for anonymizing the identified information.
How to use the documentation
----------------------------
Documentation is available in two forms: docstrings provided
with the code and a loose standing reference guide, available
from `the anonipy homepage <https://eriknovak.github.io/anonipy>`.
Modules:
anonymize: The module containing the anonymization submodules and utility.
utils: The module containing utility classes and functions.
definitions: The module containing predefined types used across the package.
constants: The module containing the predefined constants used across the package.
"""

Expand Down
30 changes: 12 additions & 18 deletions anonipy/anonymize/__init__.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,23 @@
"""
anonymize
"""Module containing the anonymization modules and utility.
The module provides a set of anonymization utilities.
The `anonymize` module provides a set of anonymization modules and utility,
including `extractors`, `generators`, and `strategies`. In addition, it provides
methods for anonymizing text based on a list of replacements.
Submodules
----------
extractors :
The module containing the extractor classes
generators :
The module containing the generator classes
strategies :
The module containing the strategy classes
regex :
The module containing the regex patterns
Modules:
extractors: The module containing the extractor classes.
generators: The module containing the generator classes.
strategies: The module containing the strategy classes.
Methods
-------
anonymize()
Methods:
anonymize(text, replacements):
Anonymize the text based on the replacements.
"""

from . import extractors
from . import generators
from . import strategies
from . import regex
from .helpers import anonymize

__all__ = ["extractors", "generators", "strategies", "regex", "anonymize"]
__all__ = ["extractors", "generators", "strategies", "anonymize"]
23 changes: 12 additions & 11 deletions anonipy/anonymize/extractors/__init__.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
"""
extractors
"""Module containing the `extractors`.
The module provides a set of extractors used in the library.
The `extractors` module provides a set of extractors used to identify relevant
information within a document.
Classes
-------
ExtractorInterface :
The class representing the extractor interface
EntityExtractor :
The class representing the entity extractor
Classes:
NERExtractor: The class representing the named entity recognition (NER) extractor.
PatternExtractor: The class representing the pattern extractor.
MultiExtractor: The class representing the multi extractor.
"""

from .interface import ExtractorInterface
from .entity_extractor import EntityExtractor
from .multi_extractor import MultiExtractor
from .ner_extractor import NERExtractor
from .pattern_extractor import PatternExtractor


__all__ = ["ExtractorInterface", "EntityExtractor"]
__all__ = ["ExtractorInterface", "MultiExtractor", "NERExtractor", "PatternExtractor"]
Loading

0 comments on commit 5510030

Please sign in to comment.