For a detailed conceptual understanding of handling sensitive data on the Databricks Data Intelligence Platform, please refer to the following blogpost
Path to Data Protection and Compliance with Databricks Data Intelligence Platform
- Unity Catalog enabled Databricks Workspaces
- User with Workspace Admin Privilege
- Unity Catalog enabled cluster with DBR 13.3 or above
The solution accelerator includes sample scripts for the following tasks
- Generate fake PII data using Python faker library
- PII detection and tagging in Unity Catalog governed Tables using Databricks labs project DiscoverX and Presidio
- Encrypt columns (including free text columns)
- Apply dynamic column level masking
-
Clone the repository into Databricks Workspace Repo
-
Create UC Catalogs and Schemas (examples below)
- DIZ Catalog: example
- DIZ Schema: default
- Dev Catalog: example_dev
- Dev Schema: default
- Dev Catalog: example_prod
- Dev Schema: default
-
Create privileged users account group (example: prod-privileged-users) and add the users. This is used for fine grained access controls.
-
Run the notebooks
Option 1: Run the notebooks manually in sequence.
- Run all the notebooks (except notebook 4. Prod CLM enforcement.py) using a Cluster in Assigned mode (Single User). DBR 13.3 or above
Option 2: Create a workflow and create a Task for each Notebook
- Set the following job parameters
Name Value diz_catalog example diz_schema default num_rows 1000 prod_catalog example_prod dev_catalog example_dev target_schema default free_text freetext privileged_group_name prod-privileged-users Option 3: Create a workflow with the JSON definition using the Jobs API create endpoint or Databricks CLI. Please note that recreating this job requires you to update the highlighted identifiers with the right values.
3a. Copy the contents of the [JSON definition](workflow/create_databricks_job.json) file and change the string "[email protected]" with your Databricks username. 3b. Create a job (via Databricks Cli or Databricks REST API) using the json (Example: `databricks jobs create --json '<json content>'`)
-
Observe the results: Browse the bronze and silver tables in the specified catalog/schema.
- The views/opinions expressed here are our own and do not necessarily represent the views/opinions of Databricks.
- The sample code provided is intended to aid in getting started and may not be production-ready. The code does not have any guarantees/warantees/support. Use it at your own risk.