Creating synthetic data | Conditional sampling #741

npatki · 2022-03-22T15:26:38Z

npatki
Mar 22, 2022
Maintainer

Last Updated: July 22, 2022 (SDV v0.16.0)

What is conditional sampling?

The SDV models generate new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.

Example: You are a college administrator. The SDV model generates all types of synthetic students, but for your project, you are only interested in science and commerce students with work experience.

Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.

How do I apply conditional sampling?

In order to use conditional sampling, you must first have an SDV model. Let's create one using the demo dataset that describes students.

Creating a model

from sdv.demo import load_tabular_demo
 
# load data from SDV's demo datasets
metadata, student_data = load_tabular_demo('student_placements', metadata=True)
student_data.head()

from sdv.tabular import GaussianCopula

# create a GaussianCopula model to synthesize the demo data 
model = GaussianCopula(table_metadata=metadata)
model.fit(student_data)

Using this model, we can sample conditions.

Applying conditions

Use a Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.

from sdv.sampling.tabular import Condition
 
# 100 science students with work experience
science_students = Condition(
   column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100)

# 200 commerce students with work experience
commerce_students = Condition(
   column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200)

When you sample from your SDV model, you can now use the sample_conditions function and pass in a list of conditions.

all_conditions = [science_students, commerce_students]
synthetic_data = model.sample_conditions(conditions=all_conditions)

And that's it – the SDV will take care of the rest!

Are the correlations between columns preserved?

Yes! When you apply conditions, the SDV models will take those values into account when generating the rest of the data.

For example, you might observe that science students have lower test scores than their counterparts. The resulting synthetic data will preserve that general correlation (while not adhering strictly to the same values).

In this example, the test score (represented by column high_perc) tends to be lower for science students. The conditionally sampled synthetic data preserves this general trend.

How can I efficiently use conditional sampling?

Conditional sampling is currently only available for single table models models (GaussianCopula, CTGAN, CopulaGAN and TVAE). We're working to bring it to multi-table and time series models too. The conditional sampling API is the same for all models but the efficiency varies.

Choosing the right model

The GaussianCopula model is the most efficient at conditional sampling because this feature is built directly into the core ML algorithm. All other models may be slower at conditional sampling because they use a reject sampling-based approach: They sample rows without any conditions and then discard the invalid rows.

There are a few other instances where conditional sampling may be inefficient:

If your model has many constraints, especially if you are conditional sampling on a column that is involved in a constraint
If you are conditionally sampling on a very rare value. We recommend using the GaussianCopula especially for this case.

Tuning the parameters

In some cases, your model may not be able to finish conditionally sampling all rows. In this case, it will return the rows it could finish and print a warning.

UserWarning: Only able to sample 150 rows for the given conditions.

If you see this warning, use the max_tries_per_batch parameter to control how many attempts the model makes to create synthetic data. The default value is 100. You can increase it to give the model a chance to finish conditional sampling, noting that this will also increase the time it takes to generate the data.

synthetic_data = model.sample_conditions(conditions=all_conditions, max_tries_per_batch=200)

Saving intermediate results

By default, the models try to sample all values for each condition at the same time. You can choose to batch results if you want to track incremental progress and be able to recover results if the program crashes.

synthetic_data = model.sample_conditions(conditions=all_conditions, batch_size=20)

See the batch sampling discussion for more details.

I found a bug / have a feature request. What can I do?

Please file an issue on GitHub paying special attention to the version of the SDV. Code snippets and stack traces will help us debug. Any other info about your use case (what are you trying to achieve with synthetic data?) will help us prioritize new feature requests.

You can also ask questions and connect with the SDV community by joining our Slack!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating synthetic data | Conditional sampling #741

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Creating synthetic data | Conditional sampling #741

npatki Mar 22, 2022 Maintainer

What is conditional sampling?

How do I apply conditional sampling?

Creating a model

Applying conditions

Are the correlations between columns preserved?

How can I efficiently use conditional sampling?

Choosing the right model

Tuning the parameters

Saving intermediate results

I found a bug / have a feature request. What can I do?

Replies: 0 comments

npatki
Mar 22, 2022
Maintainer