We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the proper way of mixing datasets provided by Enrico? What size should it be?
Enrico sets: https://github.com/google-research/FLAN/tree/main/flan/v2#download The mixture percentage: https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py
For now I use:
import datasets cot_submix = datasets.load_dataset('conceptofmind/cot_submix_original') dialog_submix = datasets.load_dataset('conceptofmind/dialog_submix_original') niv2_submix = datasets.load_dataset('conceptofmind/niv2_submix_original') flan2021_submix = datasets.load_dataset('conceptofmind/flan2021_submix_original') t0_submix = datasets.load_dataset('conceptofmind/t0_submix_original') cot_zsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt') cot_fsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt') dialog_zsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt') dialog_fsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt') niv2_zsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt') niv2_fsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt') flan_zsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt') flan_fsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt') flan_zsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt') flan_fsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt') t0_zsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt') t0_fsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt') t0_zsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt') t0_fsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt') all_datasets = [ flan_zsopt, flan_fsopt, flan_zsnoopt, flan_fsnoopt, # t0_zsopt, t0_fsopt, t0_zsnoopt, t0_fsnoopt, # niv2_zsopt, niv2_fsopt, # cot_zsopt, cot_fsopt, # dialog_zsopt, dialog_fsopt, ] probabilities = [ 0.4/4, 0.4/4, 0.4/4, 0.4/4, # 0.32/4, 0.32/4, 0.32/4, 0.32/4, # 0.2/2, 0.2/2, # 0.05/2, 0.05/2, # 0.03/2, 0.03/2, ] flan2022_submix = datasets.interleave_datasets( datasets=all_datasets, probabilities=probabilities, seed=567, stopping_strategy='first_exhausted', ) flan2022_submix.to_csv('flan2022_submix.csv')
Size of final dataset is 3699512.
Is it correct?
The text was updated successfully, but these errors were encountered:
@takiholadi Yes, this looks correct!
Sorry, something went wrong.
@takiholadi do you use the output as is or do you uniformise the prompts across the dataset ?
No branches or pull requests
What is the proper way of mixing datasets provided by Enrico? What size should it be?
Enrico sets: https://github.com/google-research/FLAN/tree/main/flan/v2#download
The mixture percentage: https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py
For now I use:
Size of final dataset is 3699512.
Is it correct?
The text was updated successfully, but these errors were encountered: