A concise summary with specific recommendations for selecting PPL sample size in multilingual datasets:
When measuring perplexity (PPL) in multilingual models, the number of samples needed per language increases with the diversity and size of the dataset. However, there are diminishing returns as the number of languages grows, particularly when languages share structural or linguistic similarities.
Benchmarks like XTREME and WMT suggest that 500-1,000 samples per language is often sufficient for accurate evaluation. This allows you to capture a representative sample of each language's linguistic features without overwhelming computational resources. As the number of languages increases, it’s common to reduce the sample size for each language proportionally, especially if certain languages dominate the dataset or have significant overlap in characteristics.
In the XTREME benchmark, English uses 10,000 samples, while each of the 40+ other languages uses 1,000-2,000 samples to maintain feasibility across multilingual tasks. Similarly, WMT reduces sample sizes for lower-resource languages, scaling from several thousand for high-resource languages to a few hundred or 1,000 per language when handling many languages. Both examples demonstrate a practical approach to balancing resource usage and linguistic coverage (XTREME, WMT Papers).
-
Start with 500-1,000 samples per language: This size is commonly used in NLP tasks to balance performance and resource efficiency, ensuring that linguistic coverage is broad enough.
-
Scale based on number of languages: For datasets with many languages (e.g., 40+), consider reducing the number of samples per language to 50-100, as is done in benchmarks like XTREME.
An early stopping metric was designed and tested with the salamandra 2b model, for use in maasuring the quality of quantizations vy way of kl-divergence. We found in our dataset that 48 chunks (roughly 11.5k tokens/125 samples per language) was sufficient to capture variability in the dataset at 95% confidence using a bayesian approximation.
-
XTREME: A Massively Multilingual Benchmark for Evaluating Cross-lingual Generalization
- Authors: Hu et al.
- Year: 2020
- Source: arXiv
- Summary: XTREME evaluates models across many languages and scales down sample sizes to maintain feasibility while preserving coverage across languages.
-
WMT: Workshop on Machine Translation Shared Tasks
- Source: WMT Papers
- Summary: WMT tasks often reduce sample sizes per language as the number of target languages grows, demonstrating that smaller samples can still yield accurate model evaluations.