detection-of-personal-data
is a CLI tool to detect sensitive personal data, including names, contact information, health details, identification numbers, and financial details.
Users can input a variety of text files (e.g., .txt
, .csv
) which the service then processes, returning a JSON. The JSON not only indicates the presence of personal information but also provides tags for the detected data.
NLTK is a leading platform for building Python programs to work with human language data. It provides easy - to - use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial - strength NLP libraries, and an active discussion forum.
A regular expression is a method used in programming for pattern matching. Regular expressions provide a flexible and concise means to match strings of text.
State-of-the-art Machine Learning for PyTorch, TensorFlow and JAX. Transformers provides APIs to easily download and train state-of-the-art pretrained models.
Retrieve command help with:
poetry run detection-of-personal-data pii-detect --help
Usage: detection-of-personal-data pii-detect [OPTIONS]
Represents cli 'pii_detect' command
Options:
-i, --input TEXT path to text file [required]
-o, --output TEXT output directory where json file will be
written [default: .]
-tr, --thresh <TEXT FLOAT>... the minimum probability of private data for
labels
-f, --force overwrite existing file
--dry-run passthrough, will not write anything
--help Show this message and exit.
Example:
poetry run detection-of-personal-data pii-detect \
-tr person 0.3 \
-tr passport 0.3 \
-i ./tests/data/inputs_test/text \
-o ./tests/data/outputs -f
The repository targets python 3.9
and higher.
The repository uses Poetry as python packaging and dependency management. Be sure to have it properly installed before.
curl -sSL https://install.python-poetry.org | python3
You can follow the link below on how to install and configure Docker on your local machine:
Project is built by poetry. Initialize the project using:
poetry install
β οΈ Ensure your code complies with our linters to pass CI checks.
Code linting is performed by flake8.
poetry run flake8 --count --show-source --statistics
Static type check is performed by mypy.
poetry run mypy .
To improve code quality, we use other linters in our workflows, if you want them to succeed in the CI, please check these additional linters.
Markdown linting is performed by markdownlint-cli.
markdownlint "**/*.md"
Docker linting is performed hadolint.
hadolint Dockerfile
β οΈ Be sure to write tests that succeed to pass CI checks.
Unit testing is performed by the pytest testing framework.
poetry run pytest -v
Build a local docker image using the following command line:
docker build -t detection-of-personal-data .
Once built, you can run the container locally with the following command line:
docker run -ti --rm detection-of-personal-data
Please check out OKP4 health files :