Aussie AI

Model Evaluation

  • Last Updated 2 March, 2025
  • by David Spuler, Ph.D.

Leaderboards (Model Evaluation)

Benchmarks for Model Evaluation

  • Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
  • Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, Sandipan Dandapat, December 2023, Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://aclanthology.org/2023.emnlp-main.418/ PDF: https://aclanthology.org/2023.emnlp-main.418.pdf
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You, 3 Jun 2024, MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, https://arxiv.org/abs/2406.06565
  • Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi, 13 Jun 2024, Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning, https://arxiv.org/abs/2406.09170
  • Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
  • Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511

Research on Model Evaluation

More AI Research

Read more about: