Overview

An Open Source Project from the Data to AI Lab, at MIT

Benchmarking framework for Synthetic Data Generators

Website: https://sdv.dev
Documentation: https://sdv.dev/SDV
Repository: https://github.com/sdv-dev/SDGym
License: MIT
Development Status: Pre-Alpha

Overview

Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators based on SDV and SDMetrics.

SDGym is a part of the The Synthetic Data Vault project.

What is a Synthetic Data Generator?

A Synthetic Data Generator is a Python function (or method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has the same structure and similar mathematical properties as the real one.

Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones already included in SDGym and see how to run them.

Benchmark datasets

SDGym evaluates the performance of Synthetic Data Generators using single table, multi table and timeseries datasets stored as CSV files alongside an SDV Metadata JSON file.

Further details about the list of available datasets and how to add your own datasets to the collection can be found in the datasets documentation.

Install

SDGym can be installed using the following commands:

Using pip:

pip install sdgym

Using conda:

conda install -c sdv-dev -c conda-forge sdgym

For more installation options please visit the SDGym installation Guide

Usage

Benchmarking your own Synthesizer

SDGym evaluates Synthetic Data Generators, which are Python functions (or classes) that take as input some data, which we call the real data, learn a model from it, and output new synthetic data that has the same structure and similar mathematical properties as the real one.

As an example, let use define a synthesizer function that applies the GaussianCopula model from SDV with gaussian distribution.

import numpy as np
from sdv.tabular import GaussianCopula


def gaussian_copula(real_data, metadata):
    gc = GaussianCopula(default_distribution='gaussian')
    table_name = metadata.get_tables()[0]
    gc.fit(real_data[table_name])
    return {table_name: gc.sample()}

ℹ️ You can learn how to create your own synthesizer function here.

We can now try to evaluate this function on the asia and alarm datasets:

import sdgym

scores = sdgym.run(synthesizers=gaussian_copula, datasets=['asia', 'alarm'])

ℹ️ You can learn about different arguments for `sdgym.run` function here.

The output of the sdgym.run function will be a pd.DataFrame containing the results obtained by your synthesizer on each dataset.

synthesizer	dataset	modality	metric	score	metric_time	model_time
gaussian_copula	asia	single-table	BNLogLikelihood	-2.842690	2.762427	0.752364
gaussian_copula	alarm	single-table	BNLogLikelihood	-20.223178	7.009401	3.173832

Benchmarking the SDGym Synthesizers

If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the corresponding class, or a list of classes, to the sdgym.run function.

For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (:warning: this will take a lot of time to run!):

from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
    MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
    Uniform, VEEGAN)

all_synthesizers = [
    CLBN,
    CTGAN,
    CopulaGAN,
    HMA1,
    Identity,
    Independent,
    MedGAN,
    PAR,
    PrivBN,
    SDV,
    TVAE,
    TableGAN,
    Uniform,
    VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)

For further details about all the arguments and possibilities that the benchmark function offers please refer to the benchmark documentation

Additional References

Datasets used in SDGym are detailed here.
How to write a synthesizer is detailed here.
How to use benchmark function is detailed here.
Detailed leaderboard results for all the releases are available here.

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

Website: https://sdv.dev
Documentation: https://sdv.dev/SDV

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github		.github
conda		conda
docs		docs
privbayes		privbayes
results		results
scripts		scripts
sdgym		sdgym
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
BENCHMARK.md		BENCHMARK.md
CONTRIBUTING.rst		CONTRIBUTING.rst
DATASETS.md		DATASETS.md
HISTORY.md		HISTORY.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
SYNTHESIZERS.md		SYNTHESIZERS.md
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

What is a Synthetic Data Generator?

Benchmark datasets

Install

Usage

Benchmarking your own Synthesizer

Benchmarking the SDGym Synthesizers

Additional References

The Synthetic Data Vault

About

Releases

Packages

Languages

License

fealho/SDGym

Folders and files

Latest commit

History

Repository files navigation

Overview

What is a Synthetic Data Generator?

Benchmark datasets

Install

Usage

Benchmarking your own Synthesizer

Benchmarking the SDGym Synthesizers

Additional References

The Synthetic Data Vault

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages