This repository contains the data transformation benchmark for various domains such as smart building, COVID-19 & machine logs, and commercial datasets.
-
Smart Building
: The smart building dataset is curated from real-world domain-specific databases, including 105 cases from 21 energy companies in the United States. -
COVID-19 & Machine Log
: COVID-19 datasets is real-world data repository maintained by John Hopkins University {Link}. The machine log dataset is curated from various operating systems such as MacOS, Android, etc. -
Commercial dataset-1
: This consists of eight groups of single-table transformation from existing Auto-Pipeline benchmark {Link}. -
Commercial dataset-2
: This consists of eight groups of Join transformation from existing Auto-Pipeline benchmark {Link}.
Below is the table recording the respective number of groups and total number of cases for the benchmark.
Datasets | Number of Groups | Total cases |
---|---|---|
Smart Building | 15 | 105 |
COVID-19 & Machine Log | 2 | 6 |
Commercial-1 | 8 | 8 |
Commercial-2 | 8 | 8 |
Group Number
is the number for each group in the benchmark dataset. This is denoted as 'Group_ID'. Each group has one target schema, and multiple source datasets. Each source dataset needs to be transformed to conform to the target schema separately.Target Data Name
is given name to the transformed dataset. The representation is "Target'{$Group_ID}'" e.g. Target_1 represents the target data name for Group 1.Target Data Schema
is the expected schema of the target dataset.Target Data Description
is a domain-specific explanation of the dataset.Source Data Name
is the given name for the source dataset that is to be transformed. The representation is "Source'{$Group_ID}_{$Source_ID}'" e.g., Source1_2 represents the 2nd source dataset in Group 1.Source Data Schema
is the schema of the source dataset to be transformed.Source Data Description
is a domain-specific explanation of the dataset.Schema Change Hints
are hints about changes in schema from source to target.5 Samples of Source Data
are sample values for source schema.Ground Truth SQL
is for validation of the generated SQL response from large language models.
We developed scripts to use the benchmark. The steps are:
-
Clone the repository:
git clone https://github.com/asu-cactus/ChatGPTwithSQLscript.git
-
Navigate to the project directory:
cd ChatGPTwithSQLscript
-
Install the required packages:
pip install -r requirements.txt
-
Create a
config.py
file and put your OpenAI API key there.OPENAI_API_KEY = "{your_openai_api_key}"
-
Pick a dataset from {‘Smart Building’, ‘COVID-19 & Machine Log’,Commercial dataset-1’, ‘Commercial dataset-2’}. To view the .xlsx files, you can download them and open using Microsoft Excel or any other supporting tool. Save these files in a folder and change the ‘excel_file_path’ and ‘json_file_path’ in ‘excel2json.py’ accordingly.
# Path to the Excel file excel_file_path = '<dataset_you_picked>.xlsx' # Path to save the JSON file json_file_path = <dataset_you_picked>.json'
-
Run the
excel2json.py
script to convert the .xlsx benchmark dataset to .json format:python excel2json.py
-
You can set the configurations such as template_option, source, and target in run.py. Here's an example
template_option = 1 target_id, max_target_id = 26, 26 source_id, max_source_id = 1, 1
target_id
,max_target_id
mean the first group and the last group the script will iterate.source_id
,max_source_id
mean the first source and the last source the script will iterate. -
Run the run.py script to start the experiment:
python run.py
(In addition to the above steps, you can view each .xlsx file by downloading it to your local machine and opening the file using Microsoft Excel.)
You can build your own dataset following the Structure Of Datasets to extend this benchmark and execute the above-explained steps.
Note: The new dataset must have the explained structure of the datasets.