Aussie AI
Limitations of LLMs
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
LLMs can do some amazing new things, but they also have a lot of limitations. This article is a deep dive into limitations in various categories:
- Risks and safety
- Reasoning limitations
- Computational limitations
Safety Risks and Limitations
Your average LLM has problems with:
- Inaccuracies or misinformation (wrong facts or omissions)
- Biases (of many types)
- Insensitivity (e.g. when writing eulogies).
- Gullibility (not challenging the input text)
- Hallucinations (plausible-looking made-up facts)
- Confabulation (wrongly merging two sources)
- Dangerous or harmful answers (e.g. wrong mushroom picking advice)
- Plagiarism (in its training data set)
- Paraphrasing (plagiarism-like)
- Sensitive topics (the LLM requires training on each and every one)
- Training data quality ("Garbage in, garbage out")
- Alignment (people have purpose; LLMs only have language).
- Security (e.g. "jailbreaks")
- Refusal (knowing when it should)
- Personally Identifiable Information (PII) (e.g., emails or phone numbers in training data)
- Proprietary data leakage (e.g., trade secrets in an article used in a training data set)
- Surfacing inaccurate or outdated information
- Over-confidence (it knows not what it says)
- Veneer of authority (users tend to believe the words)
- Use for nefarious purposes (e.g., by hackers)
- Transparency (of the data, of the guardrails, of how it works, etc.)
- Privacy issues (sure, but Googling online has similar issues, so this isn't as new as everyone says)
- Legal issues (copyright violations, patentability, copyrightability, and more)
- Regulatory issues (inconsistent)
- Unintended consequences
Reasoning Limitations
Let's begin with some of the limitations that have largely been solved:
- Words about words (e.g. "words", "sentences", etc.)
- Writing style, tone, reading level, etc.
- Ending responses nicely with stop tokens and max tokens
- Tool integrations (e.g. clocks, calendars, calculators)
- Cut-off date for training data sets
- Long contexts
Some other issues:
- Explainability
- Attribution (source citations)
- Logical reasoning
- Planning
- Probabilistic non-deterministic method
- Mathematical reasoning
- Banal, bland, or overly formal writing
- Math word problems
- Crosswords and other word puzzles (e.g. anagrams, alliteration)
- Repetition (e.g., if it has nothing new to add, it may repeat a prior answer, rather than admitting that)
- Specialized domains (e.g. jargon, special meanings of words)
- Prompt engineering requirements (awkward wordings! Nobody really talks like that.)
- Oversensitivity to prompt variations (and yet, sadly, prompt engineering works)
- Ambiguity (of input queries)
- Over-explaining
- Nonsense answers
- Americanisms (e.g., word spellings and implied meanings, cultural issues like "football", etc.)
- Model "drift" (decline in accuracy over time)
- Non-repeatability (same question, different answer)
- Novice assumption (not identifying a user's higher level of knowledge from words in the questions; dare I say it's a kind of "AI-splaining")
- Words and meanings are not the same thing.
- Gibberish output (usually a bug; Transformers are just C++ programs, you know)
- Lack of common sense (although I know some people like that, too)
- Lack of a "world model"
- Lack of a sense of personal context (they don't understand what it means to be a person)
- Time/temporal reasoning (the concept of things happening in sequence is tricky)
- 3D scene visualization (LLMs struggle to understand the relationship between objects in the real world)
- Sarcasm and satire (e.g. articles espousing the benefits of "eating rocks")
- Spin, biased viewpoints, and outright disinformation/deception (of source content)
- Going rogue (usually a bug, or is it?)
- Trick questions (e.g., queries that look like common online puzzles, but aren't quite the same).
- Falling back on training data (overly complex answers)
- Detecting intentional deception or other malfeasance by users
- LLMs asking follow-up questions to clarify user requests (this capability has been improving quickly).
- Not correctly prioritizing parts of the request (i.e., given multiple requests in a prompt instruction, it doesn't always automatically know which things are most important to you)
Computational Limitations
There's really only one big problem with AI computation: it's slooow. Hence, the need for all of those expensive GPU chips. This leads to problems with:
- Cloud data center execution is expensive.
- AI phone execution problems (e.g., frozen phone, battery depletion, overheating)
- AI PC execution problems (big models are still too slow to run)
- Training data set requirements (they need to feed on lots of tokens)
- Environmental impact (e.g., by one estimate, a ten-fold need of extra data center electricity for AI answers compared to non-AI internet searches)
More Research on Limitations
Research papers that cover various other AI limitations:
- Sean Williams, James Huckle, 30 May 2024, Easy Problems That LLMs Get Wrong, https://arxiv.org/abs/2405.19616 Code: https://github.com/autogenai/easy-problems-that-llms-get-wrong
- Abdelrahman "Boda" Sadallah, Daria Kotova, Ekaterina Kochmar, 15 Mar 2024, Are LLMs Good Cryptic Crossword Solvers? https://arxiv.org/abs/2403.12094 Code: https://github.com/rdeits/cryptics
- Jonas Wallat, Adam Jatowt, Avishek Anand, March 2024, Temporal Blind Spots in Large Language Models, WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Pages 683–692, https://arxiv.org/abs/2401.12078, https://doi.org/10.1145/3616855.3635818, https://dl.acm.org/doi/abs/10.1145/3616855.3635818
- Juntu Zhao, Junyu Deng, Yixin Ye, Chongxuan Li, Zhijie Deng, Dequan Wang 22 Sept 2023, (modified: 11 Feb 2024), Lost in Translation: Conceptual Blind Spots in Text-to-Image Diffusion Models, ICLR 2024, https://openreview.net/forum?id=vb3O9jxTLc
- Victoria Basmov, Yoav Goldberg, Reut Tsarfaty, 11 Apr 2024 (v2), Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds, https://arxiv.org/abs/2305.14785
- Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds, 23 Feb 2024 (v2), A blind spot for large language models: Supradiegetic linguistic information, https://arxiv.org/abs/2306.06794
- Michael King, July 24, 2023, Large Language Models are Extremely Bad at Creating Anagrams, https://www.techrxiv.org/doi/full/10.36227/techrxiv.23712309.v1
- George Cybenko, Joshua Ackerman, Paul Lintilhac, 16 Apr 2024, TEL'M: Test and Evaluation of Language Models, https://arxiv.org/abs/2404.10200
- Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
- Mikhail Burtsev, Martin Reeves, and Adam Job. The working limitations of large language models. MIT Sloan Management Review, 65(1):1–5, 2023. https://sloanreview.mit.edu/article/the-working-limitations-of-large-language-models/
- Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
- Karl, May 10, 2023, Large Language Models: Reasoning Capabilities and Limitations, https://medium.com/@glovguy/large-language-models-reasoning-capabilities-and-limitations-951cee0ac642
- The PyCoach Apr 20, 2024, The False Promises of AI: How tech companies are fooling us https://medium.com/artificial-corner/the-false-promises-of-ai-fe23124e0fb9
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Bill Doerrfeld, Feb 6, 2024, Does Using AI Assistants Lead to Lower Code Quality? https://devops.com/does-using-ai-assistants-lead-to-lower-code-quality/
- Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
- Rafe Brena, May 24, 2024, 3 Key Differences Between Human and Machine Intelligence You Need to Know: AI is an alien intelligence https://pub.towardsai.net/3-key-differences-between-human-and-machine-intelligence-you-need-to-know-7a34dcee2cd3 (Good article about how LLMs don't have "emotions" or "intelligence" and they don't "pause".)
- Amanda Silberling, August 27, 2024, Why AI can’t spell ‘strawberry’, https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawberry/
- Kyle Wiggers, July 6, 2024, Tokens are a big reason today’s generative AI falls short, https://techcrunch.com/2024/07/06/tokens-are-a-big-reason-todays-generative-ai-falls-short/
- Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
- Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme, 7 Feb 2024 (v2), OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax? https://arxiv.org/abs/2309.09992
- Radhika Rajkumar, Sept. 6, 2024, What AI can't do, digital twins, and swiveling laptop screens, https://www.zdnet.com/article/what-ai-cant-do-digital-twins-and-swiveling-laptop-screens/
- Victor Tangermann, Sep 13, 2024, OpenAI's New "Strawberry" AI Is Still Making Idiotic Mistakes, https://futurism.com/openai-strawberry-o1-mistakes
- Michael Nuñez, November 11, 2024, AI’s math problem: FrontierMath benchmark shows how far technology still has to go, https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go/
- Dynomight, Nov 2024, Something weird is happening with LLMs and chess, https://dynomight.net/chess/
- Evan Doyle, Nov 14, 2024, AI Makes Tech Debt More Expensive, https://www.gauge.sh/blog/ai-makes-tech-debt-more-expensive
- From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download
- Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang, 8 Dec 2024, Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond, https://arxiv.org/abs/2412.06061
- Sam Liberty, Oct 15, 2024, Why AI Can’t Crack the NYT Connections Puzzle (Yet), https://medium.com/design-bootcamp/why-ai-cant-crack-the-nyt-connections-puzzle-yet-7bd3e00b4087
More AI Research
Read more about: