You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
More specifically, I tried to implement a vectorized filtering function, which did not work unless batched=True, however, it seems difficult to control this value.
My initial language filter was
def create_language_filter(target_language):
def language_filter(examples):
return [language == target_language for language in examples['language']]
return language_filter
if dataset=dataset.filter(dataset_filter, batched=False), the dataset is actually not filtered by language during testing. When I ran dataset=dataset.filter(dataset_filter, batched=True), the filtering was successful.
Testing code is below. Maybe this is not representative of how the lighteval task runs?
dataset_filter = create_language_filter(language)
dataset=dataset.filter(dataset_filter, batched=True) # switch between False and True
for i, sample in enumerate(islice(dataset, 5)):
print(f"\nSample {i + 1}:")
print(f"Language: {sample['language']}")
print(f"Text: {sample['text'][:100]}...")
Therefore, I modified the function as follows, but the evaluation could be slower due to single example processing during filtering?
def create_language_filter(target_language):
def language_filter(examples):
if isinstance(examples['language'], list):
return [language == target_language for language in examples['language']]
else:
return examples['language'] == target_language
return language_filter
Solution/Feature
I am wondering if there is interest in:
exposing this parameter to be easily configurable
set batched to be default True
whether there is another way to run the filtering such that this isn't an issue?
Thank you!
The text was updated successfully, but these errors were encountered:
Issue encountered
When defining a custom
dataset_filter
in customLightevalTaskConfig
(code), I wanted to specify ahf_filter
which filters the dataset by language.It seems that by default, we do not process the examples in batches:
lighteval/src/lighteval/utils/utils.py
Line 239 in 7295c78
More specifically, I tried to implement a vectorized filtering function, which did not work unless
batched=True
, however, it seems difficult to control this value.My initial language filter was
if
dataset=dataset.filter(dataset_filter, batched=False)
, the dataset is actually not filtered by language during testing. When I randataset=dataset.filter(dataset_filter, batched=True)
, the filtering was successful.Testing code is below. Maybe this is not representative of how the lighteval task runs?
Therefore, I modified the function as follows, but the evaluation could be slower due to single example processing during filtering?
Solution/Feature
I am wondering if there is interest in:
Thank you!
The text was updated successfully, but these errors were encountered: