hash id values are not detected and generated properly and are also not randomized. #2307

Ilevy80 · 2024-11-21T13:32:21Z

Hello,

I am trying to generate mock CSV files using real data from an existing CSV. My use case involves continuously generating these CSV files, which I later ingest into another system. Each generated CSV needs to be unique while still adhering to the patterns and structure of the original data.
Here is my code:

import pandas as pd
import numpy as np
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

np.random.seed(12)
data = pd.read_csv('customer_data.csv', sep=';')

metadata = Metadata.detect_from_dataframe(
    data=data,
    table_name='test')

synthesizer = GaussianCopulaSynthesizer(metadata)

synthesizer.fit(data)
synthetic_data = synthesizer.sample(10)

synthetic_data.to_csv('synthetic_data.csv', index=False, sep=';')

Parts of the CSV contain columns with hash-like values. For example:

TRANSID
004560009F78964B55AC1EEFA2EA073A7E21BF43
005040009F78964B55AC1EDFA2EA2758C8B2C075
005040009F78964B55AC1EDFA2EA2758C8B2C075

The problem I am facing is as follows:

Hash Generation:

The values generated for TRANSID are not actual hashes. Instead, I get values like:
sdv-pii-y3j8g, sdv-pii-efvwa, etc.

Reproducibility Issue:

The generated values for TRANSID are always the same across executions. For instance, I consistently get:
sdv-pii-y3j8g, sdv-pii-efvwa, etc.

I have tried several approaches, including the suggestions in this ticket, but nothing has worked so far. Additionally, I attempted to update the column with a custom regex_format like this:

metadata.update_column(
    column_name='TRANSID',
    sdtype='id',
    regex_format='[A-Fa-f0-9]{40}')

While this approach produces hash-like values, they are still identical across executions and look like this:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAd
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

My Questions:

Is it possible to work with hash-like values in SDV, ensuring they follow the correct format (e.g., [A-Fa-f0-9]{40})?
If yes, can SDV detect correlations between repeated hashes in the original dataset (as these hashes often represent IDs) and generate mock data with repeated hashes in the appropriate contexts?

Thank you for your help!

The text was updated successfully, but these errors were encountered:

srinify · 2024-11-22T20:21:23Z

Hi @Ilevy80 👋

I have a few questions to better understand your requirements for synthetic data.

Do you want your synthetic data to mirror the exact same values in your real data (which would follow the hash pattern) or do you want new values that follow the hash pattern?
Can you expand more on what you mean by "correlations between repeated hashes in the original dataset"? Would you like the synthetic data to mirror the same frequencies of hash ID values in your real data? Or correlations between rows belonging to a specific hash ID and other columns? Or something else entirely?

When using the SDV, updating the sdtypes and potentially the pre-processing transformers that SDV is using for each column play a significant role in how synthetic values are generated. Depending on your requirements, I can provide some more directed guidance with both of these!

Re: reproducibility (your 2nd question)

Which parts of the code are being re-run each time? If you want different synthetic data from the same synthesizer, then we recommend running fit() once, optionally saving the synthesizer object to disk, and then only re-running sample() each time you want more synthetic data.

If you run fit() then sample() on every run, we don't guarantee that different synthetic data will be generated. Your best bet is to re-run only the sampling part of your code. This is the easiest to see with columns that are assigned the id sdtype, where the SDV will generate entirely new values each time (compared to the categorical sdtype, where the SDV only uses pre-existing values in your real data).

Ilevy80 added new Automatic label applied to new issues question General question about the software labels Nov 21, 2024

Ilevy80 changed the title ~~HASH id values are not detected and generated properly and are also not randomized.~~ hash id values are not detected and generated properly and are also not randomized. Nov 21, 2024

srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Nov 22, 2024

srinify self-assigned this Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hash id values are not detected and generated properly and are also not randomized. #2307

hash id values are not detected and generated properly and are also not randomized. #2307

Ilevy80 commented Nov 21, 2024

srinify commented Nov 22, 2024

hash id values are not detected and generated properly and are also not randomized. #2307

hash id values are not detected and generated properly and are also not randomized. #2307

Comments

Ilevy80 commented Nov 21, 2024

srinify commented Nov 22, 2024