You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SDV models generate new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.
Example: You are a college administrator. The SDV model generates all types of synthetic students, but for your project, you are only interested in science and commerce students with work experience.
Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.
How do I apply conditional sampling?
In order to use conditional sampling, you must first have an SDV model. Let's create one using the demo dataset that describes students.
Creating a model
fromsdv.demoimportload_tabular_demo# load data from SDV's demo datasetsmetadata, student_data=load_tabular_demo('student_placements', metadata=True)
student_data.head()
fromsdv.tabularimportGaussianCopula# create a GaussianCopula model to synthesize the demo data model=GaussianCopula(table_metadata=metadata)
model.fit(student_data)
Using this model, we can sample conditions.
Applying conditions
Use a Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.
fromsdv.sampling.tabularimportCondition# 100 science students with work experiencescience_students=Condition(
column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100)
# 200 commerce students with work experiencecommerce_students=Condition(
column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200)
When you sample from your SDV model, you can now use the sample_conditions function and pass in a list of conditions.
And that's it – the SDV will take care of the rest!
Are the correlations between columns preserved?
Yes! When you apply conditions, the SDV models will take those values into account when generating the rest of the data.
For example, you might observe that science students have lower test scores than their counterparts. The resulting synthetic data will preserve that general correlation (while not adhering strictly to the same values).
In this example, the test score (represented by column high_perc) tends to be lower for science students. The conditionally sampled synthetic data preserves this general trend.
How can I efficiently use conditional sampling?
Conditional sampling is currently only available for single table models models (GaussianCopula, CTGAN, CopulaGAN and TVAE). We're working to bring it to multi-table and time series models too. The conditional sampling API is the same for all models but the efficiency varies.
Choosing the right model
The GaussianCopula model is the most efficient at conditional sampling because this feature is built directly into the core ML algorithm. All other models may be slower at conditional sampling because they use a reject sampling-based approach: They sample rows without any conditions and then discard the invalid rows.
There are a few other instances where conditional sampling may be inefficient:
If your model has many constraints, especially if you are conditional sampling on a column that is involved in a constraint
If you are conditionally sampling on a very rare value. We recommend using the GaussianCopula especially for this case.
Tuning the parameters
In some cases, your model may not be able to finish conditionally sampling all rows. In this case, it will return the rows it could finish and print a warning.
UserWarning: Only able to sample 150 rows for the given conditions.
If you see this warning, use the max_tries_per_batch parameter to control how many attempts the model makes to create synthetic data. The default value is 100. You can increase it to give the model a chance to finish conditional sampling, noting that this will also increase the time it takes to generate the data.
By default, the models try to sample all values for each condition at the same time. You can choose to batch results if you want to track incremental progress and be able to recover results if the program crashes.
I found a bug / have a feature request. What can I do?
Please file an issue on GitHub paying special attention to the version of the SDV. Code snippets and stack traces will help us debug. Any other info about your use case (what are you trying to achieve with synthetic data?) will help us prioritize new feature requests.
You can also ask questions and connect with the SDV community by joining our Slack!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Last Updated: July 22, 2022 (SDV v0.16.0)
What is conditional sampling?
The SDV models generate new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.
Example: You are a college administrator. The SDV model generates all types of synthetic students, but for your project, you are only interested in science and commerce students with work experience.
Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.
How do I apply conditional sampling?
In order to use conditional sampling, you must first have an SDV model. Let's create one using the demo dataset that describes students.
Creating a model
Using this model, we can sample conditions.
Applying conditions
Use a
Condition
object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.When you sample from your SDV model, you can now use the
sample_conditions
function and pass in a list of conditions.And that's it – the SDV will take care of the rest!
Are the correlations between columns preserved?
Yes! When you apply conditions, the SDV models will take those values into account when generating the rest of the data.
For example, you might observe that science students have lower test scores than their counterparts. The resulting synthetic data will preserve that general correlation (while not adhering strictly to the same values).
In this example, the test score (represented by column
high_perc
) tends to be lower for science students. The conditionally sampled synthetic data preserves this general trend.How can I efficiently use conditional sampling?
Conditional sampling is currently only available for single table models models (GaussianCopula, CTGAN, CopulaGAN and TVAE). We're working to bring it to multi-table and time series models too. The conditional sampling API is the same for all models but the efficiency varies.
Choosing the right model
The GaussianCopula model is the most efficient at conditional sampling because this feature is built directly into the core ML algorithm. All other models may be slower at conditional sampling because they use a reject sampling-based approach: They sample rows without any conditions and then discard the invalid rows.
There are a few other instances where conditional sampling may be inefficient:
Tuning the parameters
In some cases, your model may not be able to finish conditionally sampling all rows. In this case, it will return the rows it could finish and print a warning.
If you see this warning, use the
max_tries_per_batch
parameter to control how many attempts the model makes to create synthetic data. The default value is 100. You can increase it to give the model a chance to finish conditional sampling, noting that this will also increase the time it takes to generate the data.Saving intermediate results
By default, the models try to sample all values for each condition at the same time. You can choose to batch results if you want to track incremental progress and be able to recover results if the program crashes.
See the batch sampling discussion for more details.
I found a bug / have a feature request. What can I do?
Please file an issue on GitHub paying special attention to the version of the SDV. Code snippets and stack traces will help us debug. Any other info about your use case (what are you trying to achieve with synthetic data?) will help us prioritize new feature requests.
You can also ask questions and connect with the SDV community by joining our Slack!
Beta Was this translation helpful? Give feedback.
All reactions