Aussie AI

Model Evaluation

Last Updated 2 March, 2025

by David Spuler, Ph.D.

Leaderboards (Model Evaluation)

Hugging Face, 2024, Open LLM Leaderboard, https://chat.lmsys.org/?leaderboard https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
LMSYS, 2024, LMSYS Chatbot Arena Leaderboard, https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Hugging Face, 2024, LLM-Perf Leaderboard, https://huggingface.co/spaces/optimum/llm-perf-leaderboard
Hugging Face, 2024, Big Code Models Leaderboard, https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
University of California, Berkeley, 2024, Berkeley Function-Calling Leaderboard, https://gorilla.cs.berkeley.edu/leaderboard.html https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard
Samuel J. Paech, 3 Jan 2024 (v2), EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models, https://arxiv.org/abs/2312.06281 https://github.com/EQ-bench/EQ-Bench https://eqbench.com/

Benchmarks for Model Evaluation

Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511

Research on Model Evaluation

Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou, 23 May 2024, Lessons from the Trenches on Reproducible Evaluation of Language Models, https://arxiv.org/abs/2405.14782 (Model evaluation theory and practice with the lm-eval test harness tool.)
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
Yifan Wei, Yisong Su, Huanhuan Ma, Xiaoyan Yu, Fangyu Lei, Yuanzhe Zhang, Jun Zhao, Kang Liu, 8 Oct 2023, MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Language Models, https://arxiv.org/abs/2310.05157
George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
Gayathri Saranathan, Mahammad Parwez Alam, James Lim, Suparna Bhattacharya, Soon Yee Wong, Foltin Martin & Cong Xu, 2024, DELE: Data Efficient LLM Evaluation, Hewlett Packard Labs, Navigating and Addressing Data Problems for Foundation Models (DPFM) Workshop, ICLR 2024, https://openreview.net/pdf?id=I8bsxPWLNF
Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang, 17 Mar 2024 (v2), Compressing LLMs: The Truth is Rarely Pure and Never Simple, https://arxiv.org/abs/2310.01382 Code: https://github.com/VITA-Group/llm-kick (A set of tasks to evaluate LLMs.)
AADITYA NAIK, ADAMSTEIN, YINJUN WU, MAYURNAIK, ERIC WONG, April 2024, TorchQL: A Programming Framework for Integrity Constraints in Machine Learning, Proc. ACM Program. Lang., Vol. 8, No. OOPSLA1, Article 124. PDF: https://dl.acm.org/doi/pdf/10.1145/3649841
Tal Peretz, 15 NOV 2023, The Developer's Guide to Production-Grade LLM Apps: Advanced Techniques for Maximizing LLM Performance, https://buildingaistuff.com/p/the-developers-guide-to-production
Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin, 22 Jan 2024, Benchmarking Large Multimodal Models against Common Corruptions, https://arxiv.org/abs/2401.11943 Code: https://github.com/sail-sg/MMCBench
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein, Jan 2024, GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation, https://arxiv.org/abs/2401.04092 Code: https://github.com/3DTopia/GPTEval3D Project: https://gpteval3d.github.io/
Lan Chu, Jan 2024, LLM Output — Evaluating, debugging, and interpreting, Towards AI, https://pub.towardsai.net/llm-output-evaluating-debugging-and-interpreting-f3bd29e7d14d
Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo, 9 Jun 2024, The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models, https://arxiv.org/abs/2406.05761 Code: https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi, 7 Jun 2024, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, https://arxiv.org/abs/2406.04770 Code: https://hf.co/spaces/allenai/WildBench
Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024, Introducing Hard Prompts Category in Chatbot Arena, https://lmsys.org/blog/2024-05-17-category-hard/
Louis Bouchard, Jun 24, 2024, LLM Evals: What, why, when and how, https://www.louisbouchard.ai/llm-evals/
Clémentine Fourrier, May 23, 2024 Let's talk about LLM evaluation, https://huggingface.co/blog/clefourrier/llm-evaluation
Jeffrey Ip, November 7, 2023, How to Evaluate LLM Applications: The Complete Guide, https://www.confident-ai.com/blog/how-to-evaluate-llm-applications
Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
Anirban Ghoshal, July 3, 2024, AWS approach to RAG evaluation could help enterprises reduce AI spending, https://www.infoworld.com/article/3715629/aws-new-approach-to-rag-evaluation-could-help-enterprises-reduce-ai-spending.html
Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P. Gomes, Wen Sun, Kilian Q. Weinberger, 8 Jul 2024, On Speeding Up Language Model Evaluation, https://arxiv.org/abs/2407.06172
HELM, July 2024 (accessed), A holistic framework for evaluating foundation models, Stanford University, https://crfm.stanford.edu/helm/lite/latest/
Juan Pablo Bottaro, April 25, 2024, Musings on building a Generative AI product, https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product?_l=en_US
Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
Andrew Ng, Sep 2024, X post, https://x.com/AndrewYNg/status/1829190549842321758 (Dropping token prices for LLMs means developers can focus on the app layer.)
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 6 Aug 2024 (v3), RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654 https://github.com/hsiehjackson/RULER
Lior Solomon, Sep 2024, Gen AI testing strategies and tools, https://medium.com/ai-in-grc/gen-ai-testing-strategies-and-tools-257383e5cbfb
Michael Nuñez, September 9, 2024, LightEval: Hugging Face’s open-source solution to AI’s accountability problem, https://venturebeat.com/ai/lighteval-hugging-faces-open-source-solution-to-ais-accountability-problem/
Michael Nuñez, September 13, 2024, Microsoft’s Windows Agent Arena: Teaching AI assistants to navigate your PC, https://venturebeat.com/ai/microsofts-windows-agent-arena-teaching-ai-assistants-to-navigate-your-pc/
Flow AI, Sep 2024, Flow Judge: An Open Small Language Model for LLM System Evaluations, https://www.flow-ai.com/blog/flow-judge
Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, Kate Olszewska, 20 Sep 2024 (v2), Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries, https://arxiv.org/abs/2409.12640 (Long context model evaluation dataset.)
Ou, Anthony C., Feb 2024, Large Language Model Routing with Benchmark Datasets, Master's Thesis, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, https://dspace.mit.edu/handle/1721.1/153846
Cameron R. Wolfe, Ph.D., Dec 02, 2024, Finetuning LLM Judges for Evaluation: The Prometheus suite, JudgeLM, PandaLM, AutoJ, and more..., https://cameronrwolfe.substack.com/p/finetuned-judge
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu, 10 Dec 2024 (v2), LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, https://arxiv.org/abs/2412.05579 https://github.com/CSHaitao/Awesome-LLMs-as-Judges
Liam Seymour, Basar Kutukcu, Sabur Baidya, 19 Dec 2024, Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs, https://arxiv.org/abs/2412.15352 https://github.com/LiamS57/orin-llm-testing
Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan, 22 Dec 2024, MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge, https://arxiv.org/abs/2412.17032 https://github.com/probe2/multi-hop/ (Model evaluation of reasoning abilities.)
Latent Space, Dec 28, 2024, The 2025 AI Engineering Reading List: We picked 50 paper/models/blogs across 10 fields in AI Eng: LLMs, Benchmarks, Prompting, RAG, Agents, CodeGen, Vision, Voice, Diffusion, Finetuning. If you're starting from scratch, start here. https://www.latent.space/p/2025-papers
Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu, 24 Dec 2024, LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating, https://arxiv.org/abs/2412.18424
Y Li, H Jiang, Q Wu, X Luo, S Ahn, C Zhang, AH Abdi, Dec 2024, SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_93.pdf (Evaluating model performance with KV cache compression.)
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
Lucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin, Omri Isac, Taylor T. Johnson, Guy Katz, Ekaterina Komendantskaya, Augustin Lemesle, Edoardo Manino, Artjoms Šinkarovs, Haoze Wu, 10 Jan 2025, Neural Network Verification is a Programming Language Challenge, https://arxiv.org/abs/2501.05867
Dr. Marcel Müller, Jan 2025, Why Generative-AI Apps’ Quality Often Sucks and What to Do About It: How to get from PoCs to tested high-quality applications in production, https://towardsdatascience.com/why-generative-ai-apps-quality-often-sucks-and-what-to-do-about-it-f84407f263c3
Bharani Subramaniam, 13 February 2025, Emerging Patterns in Building GenAI Products, https://martinfowler.com/articles/gen-ai-patterns/
Nikhil, February 26, 2025, How to Compare Two LLMs in Terms of Performance: A Comprehensive Web Guide for Evaluating and Benchmarking Language Models, https://www.marktechpost.com/2025/02/26/how-to-compare-two-llms-in-terms-of-performance-a-comprehensive-web-guide-for-evaluating-and-benchmarking-language-models/
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)

Aussie AI

Model Evaluation

Leaderboards (Model Evaluation)

Benchmarks for Model Evaluation

Research on Model Evaluation

More AI Research

Quick Links

Product

New to Writing?

Writing Styles