Generation of Realistic Tabular data
with pretrained Transformer-based language models
Our GReaT framework utilizes the capabilities of pretrained large language Transformer models to synthesize realistic tabular data. New samples are generated with just a few lines of code, following an easy-to-use API. Please see our publication for more details.
The GReaT framework can be easily installed using with pip - requires a Python version >= 3.9:
pip install be-great
In the example below, we show how the GReaT approach is used to generate synthetic tabular data for the California Housing dataset.
from be_great import GReaT
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True).frame
model = GReaT(llm='distilgpt2', batch_size=32, epochs=25)
model.fit(data)
synthetic_data = model.sample(n_samples=100)
If you use GReaT, please link or cite our work:
@inproceedings{borisov2023language,
title={Language Models are Realistic Tabular Data Generators},
author={Vadim Borisov and Kathrin Sessler and Tobias Leemann and Martin Pawelczyk and Gjergji Kasneci},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=cEygmQNOeI}
}
We sincerely thank the HuggingFace 🤗 framework.