- [2023.11.22] We have supported many API-based models, include Baidu, ByteDance, Huawei, 360. Welcome to Models section for more details.
- [2023.11.20] Thanks helloyongyang for supporting the evaluation with LightLLM as backent. Welcome to Evaluation With LightLLM for more details.
- [2023.11.13] We are delighted to announce the release of OpenCompass v0.1.8. This version enables local loading of evaluation benchmarks, thereby eliminating the need for an internet connection. Please note that with this update, you must re-download all evaluation datasets to ensure accurate and up-to-date results.
- [2023.11.06] We have supported several API-based models, include ChatGLM Pro@Zhipu, ABAB-Chat@MiniMax and Xunfei. Welcome to Models section for more details.
- [2023.10.24] We release a new benchmark for evaluating LLMs’ capabilities of having multi-turn dialogues. Welcome to BotChat for more details.
- [2023.09.26] We update the leaderboard with Qwen, one of the best-performing open-source models currently available, welcome to our homepage for more details.
- [2023.09.20] We update the leaderboard with InternLM-20B, welcome to our homepage for more details.
- [2023.09.19] We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our homepage for more details.
- [2023.09.18] We have released long context evaluation guidance.
- [2023.09.08] We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our homepage for more details.
- [2023.09.06] Baichuan2 team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- [2023.09.02] We have supported the evaluation of Qwen-VL in OpenCompass.
- [2023.08.25] TigerBot team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- [2023.08.21] Lagent has been released, which is a lightweight framework for building LLM-based agents. We are working with Lagent team to support the evaluation of general tool-use capability, stay tuned!
- [2023.08.18] We have supported evaluation for multi-modality learning, include MMBench, SEED-Bench, COCO-Caption, Flickr-30K, OCR-VQA, ScienceQA and so on. Leaderboard is on the road. Feel free to try multi-modality evaluation with OpenCompass !
- [2023.08.18] Dataset card is now online. Welcome new evaluation benchmark OpenCompass !
- [2023.08.11] Model comparison is now online. We hope this feature offers deeper insights!
- [2023.08.11] We have supported LEval.
- [2023.08.10] OpenCompass is compatible with LMDeploy. Now you can follow this instruction to evaluate the accelerated models provide by the Turbomind.
- [2023.08.10] We have supported Qwen-7B and XVERSE-13B ! Go to our leaderboard for more results! More models are welcome to join OpenCompass.
- [2023.08.09] Several new datasets(CMMLU, TydiQA, SQuAD2.0, DROP) are updated on our leaderboard! More datasets are welcomed to join OpenCompass.
- [2023.08.07] We have added a script for users to evaluate the inference results of MMBench-dev.
- [2023.08.05] We have supported GPT-4! Go to our leaderboard for more results! More models are welcome to join OpenCompass.
- [2023.07.27] We have supported CMMLU! More datasets are welcome to join OpenCompass.