Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you generate SSNs without dashes? #2283

Open
npatki opened this issue Nov 7, 2024 · 1 comment
Open

How do you generate SSNs without dashes? #2283

npatki opened this issue Nov 7, 2024 · 1 comment
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@npatki
Copy link
Contributor

npatki commented Nov 7, 2024

Filing this question on behalf of a user from a private thread.

The auto-generated plan sees that a column is PII, but it is SSN without the dashes …. With the sdtype = ssn still work given no dashes or is the a custom generator on our side to be developed?

@npatki npatki added question General question about the software new Automatic label applied to new issues labels Nov 7, 2024
@npatki
Copy link
Contributor Author

npatki commented Nov 7, 2024

Assuming that this is the metadata you have for your ssn column:

"my_ssn_column": {
  "sdtype": "ssn",
  "pii": true
}

Then by default, SDV synthesizers will generate random SSN values that contain dashes, for eg. 236-57-5670. This is happening because SDV uses the Faker library for PII anonymization -- and Faker is only capable of producing SSNs without dashes (see the Faker documentation).

The fix: Override the column

Luckily you can override the anonymization method. Instead of using Faker's SSN generator, you supply a generic generator that combines 9 random digits. To do this, create a generic transformer and apply it using the update_transformers function.

from rdt.transformers.pii import AnonymizedFaker

# a generic generator that creates combinations of 9 digits
my_ssn_transformer = AnonymizedFaker(
    provider_name=None,
    function_name='bothify',
    function_kwargs={'text': '#########'}
)

# apply this generator to the ssn column of your synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)

synthesizer.auto_assign_transformers(data)
synthesizer.update_transformers({
    'my_ssn_column': my_ssn_transformer
})

synthesizer.fit(data)

Now the synthetic data will contain SSNs values without dashes: 375766167

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

1 participant