Skip to content

Commit

Permalink
Replace the Lucene-based filter with a fuzzy dictionary filter (#185)
Browse files Browse the repository at this point in the history
* #176 A fuzzy dictionary filter to replace the lucene dictionary filter to remove dependencies.
  • Loading branch information
jzonthemtn authored Dec 26, 2024
1 parent b8596a6 commit 98547d3
Show file tree
Hide file tree
Showing 113 changed files with 211,125 additions and 879 deletions.
Binary file removed data/index-data/cities.bz2
Binary file not shown.
Binary file removed data/index-data/counties.bz2
Binary file not shown.
Binary file removed data/index-data/hospitals-abbreviations.bz2
Binary file not shown.
Binary file removed data/index-data/hospitals.bz2
Binary file not shown.
Binary file removed data/index-data/names.bz2
Binary file not shown.
10 changes: 0 additions & 10 deletions data/index-data/sources

This file was deleted.

Binary file removed data/index-data/states.bz2
Binary file not shown.
Binary file removed data/index-data/surnames.bz2
Binary file not shown.
Binary file removed data/indexes/cities/_0.cfe
Binary file not shown.
Binary file removed data/indexes/cities/_0.cfs
Binary file not shown.
Binary file removed data/indexes/cities/_0.si
Binary file not shown.
Binary file removed data/indexes/cities/segments_2
Binary file not shown.
Empty file removed data/indexes/cities/write.lock
Empty file.
Binary file removed data/indexes/counties/_0.cfe
Binary file not shown.
Binary file removed data/indexes/counties/_0.cfs
Binary file not shown.
Binary file removed data/indexes/counties/_0.si
Binary file not shown.
Binary file removed data/indexes/counties/segments_1
Binary file not shown.
Binary file removed data/indexes/counties/segments_2
Binary file not shown.
Empty file removed data/indexes/counties/write.lock
Empty file.
Binary file removed data/indexes/hospital-abbreviations/_0.cfe
Binary file not shown.
Binary file removed data/indexes/hospital-abbreviations/_0.cfs
Binary file not shown.
Binary file removed data/indexes/hospital-abbreviations/_0.si
Binary file not shown.
Binary file removed data/indexes/hospital-abbreviations/segments_1
Binary file not shown.
Binary file removed data/indexes/hospital-abbreviations/segments_2
Binary file not shown.
Empty file.
Binary file removed data/indexes/hospitals/_0.cfe
Binary file not shown.
Binary file removed data/indexes/hospitals/_0.cfs
Binary file not shown.
Binary file removed data/indexes/hospitals/_0.si
Binary file not shown.
Binary file removed data/indexes/hospitals/segments_2
Binary file not shown.
Empty file removed data/indexes/hospitals/write.lock
Empty file.
Binary file removed data/indexes/names/_0.cfe
Binary file not shown.
Binary file removed data/indexes/names/_0.cfs
Binary file not shown.
Binary file removed data/indexes/names/_0.si
Binary file not shown.
Binary file removed data/indexes/names/segments_2
Binary file not shown.
Empty file removed data/indexes/names/write.lock
Empty file.
Binary file removed data/indexes/states/_0.cfe
Binary file not shown.
Binary file removed data/indexes/states/_0.cfs
Binary file not shown.
Binary file removed data/indexes/states/_0.si
Binary file not shown.
Binary file removed data/indexes/states/segments_1
Binary file not shown.
Binary file removed data/indexes/states/segments_2
Binary file not shown.
Empty file removed data/indexes/states/write.lock
Empty file.
Binary file removed data/indexes/surnames/_2.fdt
Binary file not shown.
Binary file removed data/indexes/surnames/_2.fdx
Binary file not shown.
Binary file removed data/indexes/surnames/_2.fnm
Binary file not shown.
Binary file removed data/indexes/surnames/_2.si
Binary file not shown.
Binary file removed data/indexes/surnames/_2_Lucene50_0.doc
Binary file not shown.
Binary file removed data/indexes/surnames/_2_Lucene50_0.tim
Binary file not shown.
Binary file removed data/indexes/surnames/_2_Lucene50_0.tip
Binary file not shown.
Binary file removed data/indexes/surnames/segments_2
Binary file not shown.
Empty file removed data/indexes/surnames/write.lock
Empty file.
2 changes: 1 addition & 1 deletion docs/docs/filter_policies/filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Phileas uses several methods to identify phEyeFilter's names.
| [First Names](filters/persons_names/first-names.md) | Identifies common first names |
| [Surnames](filters/persons_names/surnames.md) | Identifies common surnames |
| [Person's Names (NER)](filters/persons_names/ph-eye) | Identifies full names using natural language processing analysis |
| [Physician's Names (NER)](filters/persons_names/physician-names-ner.md) | Identifies physician names using natural language processing analysis |
| [Physician's Names (NER)](filters/persons_names/physician-names) | Identifies physician names using natural language processing analysis |

### Other Filters

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ At least one of `terms` or `files` must be provided.
### Optional Parameters

| Parameter | Description | Default Value |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- |
| ---------------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --------------------- |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `fuzzy` | When set to true, the dictionary will employ fuzzy comparisons. Use the `sensitivity` parameter to control the level of fuzziness. Setting this value to false will disable fuzziness and provide a higher level of performance. | `false` |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `off` meaning only exact matches, `low`, `medium`, and `high`. Only applies when `fuzzy` is set to `true`. | `medium` |
| `classification` | Used to apply an arbitrary label to the identifier, such as "patient-id", or "account-number." | `"custom-identifier"` |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. Only applies when `fuzzy` is set to `true`. | `medium` |

### Filter Strategies

Expand Down
1 change: 1 addition & 0 deletions docs/docs/filter_policies/filters/locations/cities.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ This filter has no required parameters.
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `cityFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
3 changes: 2 additions & 1 deletion docs/docs/filter_policies/filters/locations/counties.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ This filter has no required parameters.
### Optional Parameters

| Parameter | Description | Default Value |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------|---------------|
| `countyFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ This filter has no required parameters.
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `hospitalAbbreviationFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
1 change: 1 addition & 0 deletions docs/docs/filter_policies/filters/locations/hospitals.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ This filter has no required parameters.
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `hospitalFilterStrategies` | A list of filter strategies. | None |
| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This filter has no required parameters.
| `stateAbbreviationsFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
1 change: 1 addition & 0 deletions docs/docs/filter_policies/filters/locations/states.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This filter has no required parameters.
| `stateFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ This filter has no required parameters.
| `firstNameFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This filter has no required parameters.
| `physicianNameFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ This filter has no required parameters.
| `surnameFilterStrategies` | A list of filter strategies. | None |
| `enabled` | When set to false, the filter will be disabled and not applied | `true` |
| `ignored` | A list of terms to be ignored by the filter. | None |
| `capitalized` | Whether or not the first letter of the term must be capitalized to be redacted. | `false` |

### Filter Strategies

Expand Down
5 changes: 0 additions & 5 deletions phileas-core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,6 @@
<artifactId>phileas-services-alerts</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>com.googlecode.libphonenumber</groupId>
<artifactId>libphonenumber</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
import ai.philterd.phileas.model.filter.Filter;
import ai.philterd.phileas.model.filter.FilterConfiguration;
import ai.philterd.phileas.model.filter.rules.dictionary.BloomFilterDictionaryFilter;
import ai.philterd.phileas.model.filter.rules.dictionary.LuceneDictionaryFilter;
import ai.philterd.phileas.model.filter.rules.dictionary.FuzzyDictionaryFilter;
import ai.philterd.phileas.model.policy.Policy;
import ai.philterd.phileas.model.policy.filters.CustomDictionary;
import ai.philterd.phileas.model.policy.filters.Identifier;
Expand Down Expand Up @@ -794,37 +794,33 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
.withWindowSize(phileasConfiguration.spanWindowSize())
.build();

if(customDictionary.isFuzzy()) {
// Only enable the filter if there is at least one term present.
if(!terms.isEmpty()) {

LOGGER.info("Custom fuzzy dictionary contains {} terms.", terms.size());

final SensitivityLevel sensitivityLevel = SensitivityLevel.fromName(customDictionary.getSensitivity());
final String classification = customDictionary.getClassification();
final boolean capitalized = false;

enabledFilters.add(new LuceneDictionaryFilter(FilterType.CUSTOM_DICTIONARY, filterConfiguration, sensitivityLevel,
terms, capitalized, classification, index));

} else {

final boolean capitalized = customDictionary.isCapitalized();
LOGGER.info("Custom dictionary contains {} terms.", terms.size());

// Only enable the filter if there is at least one term.
// TODO: #112 Don't use a bloom filter for a small number of terms.
if(!terms.isEmpty()) {
if(customDictionary.isFuzzy()) {

final String classification = customDictionary.getClassification();
final SensitivityLevel sensitivityLevel = SensitivityLevel.fromName(customDictionary.getSensitivity());
enabledFilters.add(new FuzzyDictionaryFilter(FilterType.CUSTOM_DICTIONARY, filterConfiguration, sensitivityLevel, terms, capitalized));

} else {

// Use a bloom filter when the dictionary is not fuzzy.
enabledFilters.add(new BloomFilterDictionaryFilter(FilterType.CUSTOM_DICTIONARY, filterConfiguration, terms, classification));

}

} else {
LOGGER.warn("Custom dictionary contains no terms and will not be enabled.");
}

index++;

}

index++;

}

} else {
Expand All @@ -833,7 +829,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M

}

// Lucene dictionary filters.
// Fuzzy dictionary filters.

if(policy.getIdentifiers().hasFilter(FilterType.LOCATION_CITY) && policy.getIdentifiers().getCity().isEnabled()) {

Expand All @@ -855,7 +851,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final SensitivityLevel sensitivityLevel = policy.getIdentifiers().getCity().getSensitivityLevel();
final boolean capitalized = policy.getIdentifiers().getCity().isCapitalized();

final Filter filter = new LuceneDictionaryFilter(FilterType.LOCATION_CITY, filterConfiguration, phileasConfiguration.indexesDirectory() + "cities", sensitivityLevel, capitalized);
final Filter filter = new FuzzyDictionaryFilter(FilterType.LOCATION_CITY, filterConfiguration, sensitivityLevel, capitalized);
enabledFilters.add(filter);
filterCache.get(policy.getName()).put(FilterType.LOCATION_CITY, filter);

Expand Down Expand Up @@ -883,7 +879,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final SensitivityLevel sensitivityLevel = policy.getIdentifiers().getCounty().getSensitivityLevel();
final boolean capitalized = policy.getIdentifiers().getCounty().isCapitalized();

final Filter filter = new LuceneDictionaryFilter(FilterType.LOCATION_COUNTY, filterConfiguration, phileasConfiguration.indexesDirectory() + "counties", sensitivityLevel, capitalized);
final Filter filter = new FuzzyDictionaryFilter(FilterType.LOCATION_COUNTY, filterConfiguration, sensitivityLevel, capitalized);
enabledFilters.add(filter);
filterCache.get(policy.getName()).put(FilterType.LOCATION_COUNTY, filter);

Expand Down Expand Up @@ -911,7 +907,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final SensitivityLevel sensitivityLevel = policy.getIdentifiers().getState().getSensitivityLevel();
final boolean capitalized = policy.getIdentifiers().getState().isCapitalized();

final Filter filter = new LuceneDictionaryFilter(FilterType.LOCATION_STATE, filterConfiguration, phileasConfiguration.indexesDirectory() + "states", sensitivityLevel, capitalized);
final Filter filter = new FuzzyDictionaryFilter(FilterType.LOCATION_STATE, filterConfiguration, sensitivityLevel, capitalized);
enabledFilters.add(filter);
filterCache.get(policy.getName()).put(FilterType.LOCATION_STATE, filter);

Expand Down Expand Up @@ -939,7 +935,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final SensitivityLevel sensitivityLevel = policy.getIdentifiers().getHospital().getSensitivityLevel();
final boolean capitalized = policy.getIdentifiers().getHospital().isCapitalized();

final Filter filter = new LuceneDictionaryFilter(FilterType.HOSPITAL, filterConfiguration, phileasConfiguration.indexesDirectory() + "hospitals", sensitivityLevel, capitalized);
final Filter filter = new FuzzyDictionaryFilter(FilterType.HOSPITAL, filterConfiguration, sensitivityLevel, capitalized);
enabledFilters.add(filter);
filterCache.get(policy.getName()).put(FilterType.HOSPITAL, filter);

Expand Down Expand Up @@ -967,7 +963,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final SensitivityLevel sensitivityLevel = policy.getIdentifiers().getHospitalAbbreviation().getSensitivityLevel();
final boolean capitalized = policy.getIdentifiers().getHospitalAbbreviation().isCapitalized();

final Filter filter = new LuceneDictionaryFilter(FilterType.HOSPITAL_ABBREVIATION, filterConfiguration, phileasConfiguration.indexesDirectory() + "hospital-abbreviations", sensitivityLevel, capitalized);
final Filter filter = new FuzzyDictionaryFilter(FilterType.HOSPITAL_ABBREVIATION, filterConfiguration, sensitivityLevel, capitalized);
enabledFilters.add(filter);
filterCache.get(policy.getName()).put(FilterType.HOSPITAL_ABBREVIATION, filter);

Expand Down Expand Up @@ -995,7 +991,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final SensitivityLevel sensitivityLevel = policy.getIdentifiers().getFirstName().getSensitivityLevel();
final boolean capitalized = policy.getIdentifiers().getFirstName().isCapitalized();

final Filter filter = new LuceneDictionaryFilter(FilterType.FIRST_NAME, filterConfiguration, phileasConfiguration.indexesDirectory() + "names", sensitivityLevel, capitalized);
final Filter filter = new FuzzyDictionaryFilter(FilterType.FIRST_NAME, filterConfiguration, sensitivityLevel, capitalized);
enabledFilters.add(filter);
filterCache.get(policy.getName()).put(FilterType.FIRST_NAME, filter);

Expand Down Expand Up @@ -1023,7 +1019,7 @@ public List<Filter> getFiltersForPolicy(final Policy policy, final Map<String, M
final SensitivityLevel sensitivityLevel = policy.getIdentifiers().getSurname().getSensitivityLevel();
final boolean capitalized = policy.getIdentifiers().getSurname().isCapitalized();

final LuceneDictionaryFilter filter = new LuceneDictionaryFilter(FilterType.SURNAME, filterConfiguration, phileasConfiguration.indexesDirectory() + "surnames", sensitivityLevel, capitalized);
final Filter filter = new FuzzyDictionaryFilter(FilterType.SURNAME, filterConfiguration,sensitivityLevel, capitalized);
enabledFilters.add(filter);
filterCache.get(policy.getName()).put(FilterType.SURNAME, filter);

Expand Down
Loading

0 comments on commit 98547d3

Please sign in to comment.