Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FT] Enable batched dataset_filter #322

Open
chuandudx opened this issue Sep 21, 2024 · 0 comments
Open

[FT] Enable batched dataset_filter #322

chuandudx opened this issue Sep 21, 2024 · 0 comments
Labels
feature request New feature/request

Comments

@chuandudx
Copy link
Contributor

chuandudx commented Sep 21, 2024

Issue encountered

When defining a custom dataset_filter in custom LightevalTaskConfig (code), I wanted to specify a hf_filter which filters the dataset by language.

It seems that by default, we do not process the examples in batches:

More specifically, I tried to implement a vectorized filtering function, which did not work unless batched=True, however, it seems difficult to control this value.

My initial language filter was

def create_language_filter(target_language):
    def language_filter(examples):
            return [language == target_language for language in examples['language']]
    return language_filter

if dataset=dataset.filter(dataset_filter, batched=False), the dataset is actually not filtered by language during testing. When I ran dataset=dataset.filter(dataset_filter, batched=True), the filtering was successful.

Testing code is below. Maybe this is not representative of how the lighteval task runs?

        dataset_filter = create_language_filter(language)
        dataset=dataset.filter(dataset_filter, batched=True) # switch between False and True
        for i, sample in enumerate(islice(dataset, 5)):
            print(f"\nSample {i + 1}:")
            print(f"Language: {sample['language']}")
            print(f"Text: {sample['text'][:100]}...")

Therefore, I modified the function as follows, but the evaluation could be slower due to single example processing during filtering?

def create_language_filter(target_language):
    def language_filter(examples):
        if isinstance(examples['language'], list):
            return [language == target_language for language in examples['language']]
        else:
            return examples['language'] == target_language
    return language_filter

Solution/Feature

I am wondering if there is interest in:

  1. exposing this parameter to be easily configurable
  2. set batched to be default True
  3. whether there is another way to run the filtering such that this isn't an issue?

Thank you!

@chuandudx chuandudx added the feature request New feature/request label Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request
Projects
None yet
Development

No branches or pull requests

1 participant