Aussie AI

Limitations of LLMs

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

LLMs can do some amazing new things, but they also have a lot of limitations. This article is a deep dive into limitations in various categories:

  • Risks and safety
  • Reasoning limitations
  • Computational limitations

Safety Risks and Limitations

Your average LLM has problems with:

  • Inaccuracies or misinformation (wrong facts or omissions)
  • Biases (of many types)
  • Insensitivity (e.g. when writing eulogies).
  • Gullibility (not challenging the input text)
  • Hallucinations (plausible-looking made-up facts)
  • Confabulation (wrongly merging two sources)
  • Dangerous or harmful answers (e.g. wrong mushroom picking advice)
  • Plagiarism (in its training data set)
  • Paraphrasing (plagiarism-like)
  • Sensitive topics (the LLM requires training on each and every one)
  • Training data quality ("Garbage in, garbage out")
  • Alignment (people have purpose; LLMs only have language).
  • Security (e.g. "jailbreaks")
  • Refusal (knowing when it should)
  • Personally Identifiable Information (PII) (e.g., emails or phone numbers in training data)
  • Proprietary data leakage (e.g., trade secrets in an article used in a training data set)
  • Surfacing inaccurate or outdated information
  • Over-confidence (it knows not what it says)
  • Veneer of authority (users tend to believe the words)
  • Use for nefarious purposes (e.g., by hackers)
  • Transparency (of the data, of the guardrails, of how it works, etc.)
  • Privacy issues (sure, but Googling online has similar issues, so this isn't as new as everyone says)
  • Legal issues (copyright violations, patentability, copyrightability, and more)
  • Regulatory issues (inconsistent)
  • Unintended consequences

Reasoning Limitations

Let's begin with some of the limitations that have largely been solved:

  • Words about words (e.g. "words", "sentences", etc.)
  • Writing style, tone, reading level, etc.
  • Ending responses nicely with stop tokens and max tokens
  • Tool integrations (e.g. clocks, calendars, calculators)
  • Cut-off date for training data sets
  • Long contexts

Some other issues:

  • Explainability
  • Attribution (source citations)
  • Logical reasoning
  • Planning
  • Probabilistic non-deterministic method
  • Mathematical reasoning
  • Banal, bland, or overly formal writing
  • Math word problems
  • Crosswords and other word puzzles (e.g. anagrams, alliteration)
  • Repetition (e.g., if it has nothing new to add, it may repeat a prior answer, rather than admitting that)
  • Specialized domains (e.g. jargon, special meanings of words)
  • Prompt engineering requirements (awkward wordings! Nobody really talks like that.)
  • Oversensitivity to prompt variations (and yet, sadly, prompt engineering works)
  • Ambiguity (of input queries)
  • Over-explaining
  • Nonsense answers
  • Americanisms (e.g., word spellings and implied meanings, cultural issues like "football", etc.)
  • Model "drift" (decline in accuracy over time)
  • Non-repeatability (same question, different answer)
  • Novice assumption (not identifying a user's higher level of knowledge from words in the questions; dare I say it's a kind of "AI-splaining")
  • Words and meanings are not the same thing.
  • Gibberish output (usually a bug; Transformers are just C++ programs, you know)
  • Lack of common sense (although I know some people like that, too)
  • Lack of a "world model"
  • Lack of a sense of personal context (they don't understand what it means to be a person)
  • Time/temporal reasoning (the concept of things happening in sequence is tricky)
  • 3D scene visualization (LLMs struggle to understand the relationship between objects in the real world)
  • Sarcasm and satire (e.g. articles espousing the benefits of "eating rocks")
  • Spin, biased viewpoints, and outright disinformation/deception (of source content)
  • Going rogue (usually a bug, or is it?)
  • Trick questions (e.g., queries that look like common online puzzles, but aren't quite the same).
  • Falling back on training data (overly complex answers)
  • Detecting intentional deception or other malfeasance by users
  • LLMs asking follow-up questions to clarify user requests (this capability has been improving quickly).
  • Not correctly prioritizing parts of the request (i.e., given multiple requests in a prompt instruction, it doesn't always automatically know which things are most important to you)

Computational Limitations

There's really only one big problem with AI computation: it's slooow. Hence, the need for all of those expensive GPU chips. This leads to problems with:

  • Cloud data center execution is expensive.
  • AI phone execution problems (e.g., frozen phone, battery depletion, overheating)
  • AI PC execution problems (big models are still too slow to run)
  • Training data set requirements (they need to feed on lots of tokens)
  • Environmental impact (e.g., by one estimate, a ten-fold need of extra data center electricity for AI answers compared to non-AI internet searches)

More Research on Limitations

Research papers that cover various other AI limitations:

More AI Research

Read more about: