Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for mixing Enrico sets? #81

Open
takiholadi opened this issue Jun 6, 2023 · 2 comments
Open

Code for mixing Enrico sets? #81

takiholadi opened this issue Jun 6, 2023 · 2 comments

Comments

@takiholadi
Copy link

takiholadi commented Jun 6, 2023

What is the proper way of mixing datasets provided by Enrico? What size should it be?

Enrico sets: https://github.com/google-research/FLAN/tree/main/flan/v2#download
The mixture percentage: https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py

For now I use:

import datasets

cot_submix = datasets.load_dataset('conceptofmind/cot_submix_original')
dialog_submix = datasets.load_dataset('conceptofmind/dialog_submix_original')
niv2_submix = datasets.load_dataset('conceptofmind/niv2_submix_original')
flan2021_submix = datasets.load_dataset('conceptofmind/flan2021_submix_original')
t0_submix = datasets.load_dataset('conceptofmind/t0_submix_original')

cot_zsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
cot_fsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')

dialog_zsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
dialog_fsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')

niv2_zsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
niv2_fsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')

flan_zsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
flan_fsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
flan_zsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt')
flan_fsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt')

t0_zsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
t0_fsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
t0_zsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt')
t0_fsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt')

all_datasets = [
    flan_zsopt,
    flan_fsopt,
    flan_zsnoopt,
    flan_fsnoopt,
    #
    t0_zsopt,
    t0_fsopt,
    t0_zsnoopt,
    t0_fsnoopt,
    #
    niv2_zsopt,
    niv2_fsopt,
    #
    cot_zsopt,
    cot_fsopt,
    #
    dialog_zsopt,
    dialog_fsopt,
]

probabilities = [
    0.4/4, 0.4/4, 0.4/4, 0.4/4,
    #
    0.32/4, 0.32/4, 0.32/4, 0.32/4,
    #
    0.2/2, 0.2/2,
    #
    0.05/2, 0.05/2,
    #
    0.03/2, 0.03/2,
]

flan2022_submix = datasets.interleave_datasets(
    datasets=all_datasets,
    probabilities=probabilities,
    seed=567,
    stopping_strategy='first_exhausted',
)

flan2022_submix.to_csv('flan2022_submix.csv')

Size of final dataset is 3699512.

Is it correct?

@shayne-longpre
Copy link
Collaborator

@takiholadi Yes, this looks correct!

@vince62s
Copy link

@takiholadi do you use the output as is or do you uniformise the prompts across the dataset ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants