You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to generate mock CSV files using real data from an existing CSV. My use case involves continuously generating these CSV files, which I later ingest into another system. Each generated CSV needs to be unique while still adhering to the patterns and structure of the original data.
Here is my code:
import pandas as pd
import numpy as np
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer
np.random.seed(12)
data = pd.read_csv('customer_data.csv', sep=';')
metadata = Metadata.detect_from_dataframe(
data=data,
table_name='test')
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(10)
synthetic_data.to_csv('synthetic_data.csv', index=False, sep=';')
Parts of the CSV contain columns with hash-like values. For example:
The values generated for TRANSID are not actual hashes. Instead, I get values like:
sdv-pii-y3j8g, sdv-pii-efvwa, etc.
Reproducibility Issue:
The generated values for TRANSID are always the same across executions. For instance, I consistently get:
sdv-pii-y3j8g, sdv-pii-efvwa, etc.
I have tried several approaches, including the suggestions in this ticket, but nothing has worked so far. Additionally, I attempted to update the column with a custom regex_format like this:
Is it possible to work with hash-like values in SDV, ensuring they follow the correct format (e.g., [A-Fa-f0-9]{40})?
If yes, can SDV detect correlations between repeated hashes in the original dataset (as these hashes often represent IDs) and generate mock data with repeated hashes in the appropriate contexts?
Thank you for your help!
The text was updated successfully, but these errors were encountered:
Ilevy80
changed the title
HASH id values are not detected and generated properly and are also not randomized.
hash id values are not detected and generated properly and are also not randomized.
Nov 21, 2024
I have a few questions to better understand your requirements for synthetic data.
Do you want your synthetic data to mirror the exact same values in your real data (which would follow the hash pattern) or do you want new values that follow the hash pattern?
Can you expand more on what you mean by "correlations between repeated hashes in the original dataset"? Would you like the synthetic data to mirror the same frequencies of hash ID values in your real data? Or correlations between rows belonging to a specific hash ID and other columns? Or something else entirely?
When using the SDV, updating the sdtypes and potentially the pre-processing transformers that SDV is using for each column play a significant role in how synthetic values are generated. Depending on your requirements, I can provide some more directed guidance with both of these!
Re: reproducibility (your 2nd question)
Which parts of the code are being re-run each time? If you want different synthetic data from the same synthesizer, then we recommend running fit() once, optionally saving the synthesizer object to disk, and then only re-running sample() each time you want more synthetic data.
If you run fit() then sample() on every run, we don't guarantee that different synthetic data will be generated. Your best bet is to re-run only the sampling part of your code. This is the easiest to see with columns that are assigned the id sdtype, where the SDV will generate entirely new values each time (compared to the categorical sdtype, where the SDV only uses pre-existing values in your real data).
Hello,
I am trying to generate mock CSV files using real data from an existing CSV. My use case involves continuously generating these CSV files, which I later ingest into another system. Each generated CSV needs to be unique while still adhering to the patterns and structure of the original data.
Here is my code:
Parts of the CSV contain columns with hash-like values. For example:
The problem I am facing is as follows:
sdv-pii-y3j8g, sdv-pii-efvwa, etc.
sdv-pii-y3j8g, sdv-pii-efvwa, etc.
I have tried several approaches, including the suggestions in this ticket, but nothing has worked so far. Additionally, I attempted to update the column with a custom regex_format like this:
While this approach produces hash-like values, they are still identical across executions and look like this:
My Questions:
Thank you for your help!
The text was updated successfully, but these errors were encountered: