Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash id values are not detected and generated properly and are also not randomized. #2307

Open
Ilevy80 opened this issue Nov 21, 2024 · 1 comment
Assignees
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@Ilevy80
Copy link

Ilevy80 commented Nov 21, 2024

Hello,

I am trying to generate mock CSV files using real data from an existing CSV. My use case involves continuously generating these CSV files, which I later ingest into another system. Each generated CSV needs to be unique while still adhering to the patterns and structure of the original data.
Here is my code:

import pandas as pd
import numpy as np
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

np.random.seed(12)
data = pd.read_csv('customer_data.csv', sep=';')

metadata = Metadata.detect_from_dataframe(
    data=data,
    table_name='test')

synthesizer = GaussianCopulaSynthesizer(metadata)

synthesizer.fit(data)
synthetic_data = synthesizer.sample(10)

synthetic_data.to_csv('synthetic_data.csv', index=False, sep=';')

Parts of the CSV contain columns with hash-like values. For example:

TRANSID
004560009F78964B55AC1EEFA2EA073A7E21BF43
005040009F78964B55AC1EDFA2EA2758C8B2C075
005040009F78964B55AC1EDFA2EA2758C8B2C075

The problem I am facing is as follows:

  1. Hash Generation:
  • The values generated for TRANSID are not actual hashes. Instead, I get values like:
  • sdv-pii-y3j8g, sdv-pii-efvwa, etc.
  1. Reproducibility Issue:
  • The generated values for TRANSID are always the same across executions. For instance, I consistently get:
  • sdv-pii-y3j8g, sdv-pii-efvwa, etc.

I have tried several approaches, including the suggestions in this ticket, but nothing has worked so far. Additionally, I attempted to update the column with a custom regex_format like this:

metadata.update_column(
    column_name='TRANSID',
    sdtype='id',
    regex_format='[A-Fa-f0-9]{40}')

While this approach produces hash-like values, they are still identical across executions and look like this:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAd
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

My Questions:

  1. Is it possible to work with hash-like values in SDV, ensuring they follow the correct format (e.g., [A-Fa-f0-9]{40})?
  2. If yes, can SDV detect correlations between repeated hashes in the original dataset (as these hashes often represent IDs) and generate mock data with repeated hashes in the appropriate contexts?

Thank you for your help!

@Ilevy80 Ilevy80 added new Automatic label applied to new issues question General question about the software labels Nov 21, 2024
@Ilevy80 Ilevy80 changed the title HASH id values are not detected and generated properly and are also not randomized. hash id values are not detected and generated properly and are also not randomized. Nov 21, 2024
@srinify
Copy link
Contributor

srinify commented Nov 22, 2024

Hi @Ilevy80 👋

I have a few questions to better understand your requirements for synthetic data.

  • Do you want your synthetic data to mirror the exact same values in your real data (which would follow the hash pattern) or do you want new values that follow the hash pattern?
  • Can you expand more on what you mean by "correlations between repeated hashes in the original dataset"? Would you like the synthetic data to mirror the same frequencies of hash ID values in your real data? Or correlations between rows belonging to a specific hash ID and other columns? Or something else entirely?

When using the SDV, updating the sdtypes and potentially the pre-processing transformers that SDV is using for each column play a significant role in how synthetic values are generated. Depending on your requirements, I can provide some more directed guidance with both of these!

Re: reproducibility (your 2nd question)

Which parts of the code are being re-run each time? If you want different synthetic data from the same synthesizer, then we recommend running fit() once, optionally saving the synthesizer object to disk, and then only re-running sample() each time you want more synthetic data.

If you run fit() then sample() on every run, we don't guarantee that different synthetic data will be generated. Your best bet is to re-run only the sampling part of your code. This is the easiest to see with columns that are assigned the id sdtype, where the SDV will generate entirely new values each time (compared to the categorical sdtype, where the SDV only uses pre-existing values in your real data).

@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Nov 22, 2024
@srinify srinify self-assigned this Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants