An Open Source Project from the Data to AI Lab, at MIT
Benchmarking framework for Synthetic Data Generators
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV
- Repository: https://github.com/sdv-dev/SDGym
- License: MIT
- Development Status: Pre-Alpha
Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators based on SDV and SDMetrics.
SDGym is a part of the The Synthetic Data Vault project.
A Synthetic Data Generator is a Python function (or method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has the same structure and similar mathematical properties as the real one.
Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones already included in SDGym and see how to run them.
SDGym evaluates the performance of Synthetic Data Generators using single table, multi table and timeseries datasets stored as CSV files alongside an SDV Metadata JSON file.
Further details about the list of available datasets and how to add your own datasets to the collection can be found in the datasets documentation.
SDGym can be installed using the following commands:
Using pip
:
pip install sdgym
Using conda
:
conda install -c sdv-dev -c conda-forge sdgym
For more installation options please visit the SDGym installation Guide
SDGym evaluates Synthetic Data Generators, which are Python functions (or classes) that take as input some data, which we call the real data, learn a model from it, and output new synthetic data that has the same structure and similar mathematical properties as the real one.
As an example, let use define a synthesizer function that applies the GaussianCopula model from SDV
with gaussian
distribution.
import numpy as np
from sdv.tabular import GaussianCopula
def gaussian_copula(real_data, metadata):
gc = GaussianCopula(default_distribution='gaussian')
table_name = metadata.get_tables()[0]
gc.fit(real_data[table_name])
return {table_name: gc.sample()}
ℹ️ You can learn how to create your own synthesizer function here. |
---|
We can now try to evaluate this function on the asia
and alarm
datasets:
import sdgym
scores = sdgym.run(synthesizers=gaussian_copula, datasets=['asia', 'alarm'])
ℹ️ You can learn about different arguments for sdgym.run function here. |
---|
The output of the sdgym.run
function will be a pd.DataFrame
containing the results obtained
by your synthesizer on each dataset.
synthesizer | dataset | modality | metric | score | metric_time | model_time |
---|---|---|---|---|---|---|
gaussian_copula | asia | single-table | BNLogLikelihood | -2.842690 | 2.762427 | 0.752364 |
gaussian_copula | alarm | single-table | BNLogLikelihood | -20.223178 | 7.009401 | 3.173832 |
If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the
corresponding class, or a list of classes, to the sdgym.run
function.
For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (:warning: this will take a lot of time to run!):
from sdgym.synthesizers import (
CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
Uniform, VEEGAN)
all_synthesizers = [
CLBN,
CTGAN,
CopulaGAN,
HMA1,
Identity,
Independent,
MedGAN,
PAR,
PrivBN,
SDV,
TVAE,
TableGAN,
Uniform,
VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)
For further details about all the arguments and possibilities that the benchmark
function offers
please refer to the benchmark documentation
- Datasets used in SDGym are detailed here.
- How to write a synthesizer is detailed here.
- How to use benchmark function is detailed here.
- Detailed leaderboard results for all the releases are available here.
This repository is part of The Synthetic Data Vault Project
- Website: https://sdv.dev
- Documentation: https://sdv.dev/SDV