Large language models prove poor imitators of humans


Researchers have found that large language models (LLMs), despite their advanced capabilities, fall short in replicating the nuanced and often irrational ways humans make decisions. A recent study published in the Proceedings of the National Academy of Sciences reveals that while LLMs can mimic average human behavior in certain economic games, they fail to capture the diversity and unpredictability of individual choices, a key aspect of human nature. This work, conducted by a team at the University of Basel, highlights a significant gap between artificial intelligence and genuine human cognition, suggesting that the path to creating true digital twins of human personalities remains long and complex.

The study delved into the decision-making patterns of both humans and LLMs using established behavioral economics experiments. These “games” are designed to probe the strategic thinking, risk tolerance, and cooperative tendencies of participants. While the models demonstrated an ability to approximate the collective, aggregated behavior of human groups, they struggled to reproduce the specific, often idiosyncratic, choices made by individuals. This distinction is critical, as it underscores the difference between predicting group averages and understanding the subtle, sometimes illogical, factors that drive a single person’s actions. The findings suggest that current AI lacks the underlying cognitive and emotional frameworks that lead to the rich variety of human responses in identical situations.

Experimental Design and Methodology

To test the decision-making of LLMs against their human counterparts, the researchers employed a series of well-known behavioral games. These included scenarios designed to test fairness, such as the Ultimatum Game, and others focused on trust and reciprocity, like the Trust Game. In total, the study analyzed the responses of several prominent LLMs to 12 different economic games, comparing them against a large, pre-existing dataset of human decisions in those same scenarios. The human data was drawn from numerous previous studies, providing a robust baseline of typical human behavior.

Game-Theoretic Frameworks

The experiments were grounded in game theory, a branch of mathematics concerned with strategic decision-making. Each game presented the LLMs with a situation where they had to make a choice that would affect not only their own outcome but also that of another player. For instance, in the Dictator Game, one player decides how to split a sum of money with another, providing a measure of altruism. The researchers configured the LLMs to “act” as participants in these games, generating responses based on the prompts they were given. This allowed for a direct comparison between the machine-generated decisions and the documented human ones.

Evaluating Model Performance

The core of the analysis involved comparing the distribution of choices made by the LLMs with the distribution of choices made by humans. The key metric was not simply whether the models could find an “optimal” strategy, but whether their range of decisions mirrored the variety seen in people. While some models were fine-tuned to align with human preferences, the study found that even these adjusted AIs could not escape the tendency to produce more uniform and predictable responses than the human players. The human dataset, in contrast, was characterized by a wide spectrum of behaviors, from highly selfish to purely altruistic, and everything in between.

Divergence in Individual Behavior

The most significant finding of the study was the failure of LLMs to capture the heterogeneity of human behavior. When looking at the aggregate level, the models appeared to perform reasonably well. For example, the average amount of money offered in the Ultimatum Game by an LLM might closely match the average offer made by a human. However, this surface-level similarity masked a deeper discrepancy. The models consistently underestimated the variance in human choices. People in these experiments make all sorts of decisions: some are hyper-rational, others are driven by emotion, and many are simply inconsistent. This “noise” in human decision-making is not a flaw, but a feature of our complex cognitive makeup.

The research team noted that the LLMs tended to converge on a single, seemingly rational strategy, or a narrow range of strategies. They lacked the ability to simulate the outliers—the unusually generous or surprisingly spiteful decisions that are a hallmark of human populations. This suggests that the models, trained on vast amounts of text data, are adept at identifying common patterns but are less equipped to understand the exceptions. The study’s authors argue that this limitation is fundamental, stemming from the fact that LLMs learn from statistical patterns in language, not from lived experience or genuine emotional states.

Implications for Artificial Intelligence

The findings have profound implications for the future of AI, particularly in applications that require a deep understanding of human behavior. For example, in fields like mental health, where AI is being explored as a tool for therapy and support, the inability to model individual personality and emotional nuance could be a major stumbling block. A therapy bot that can only mimic the “average” human response is unlikely to be effective in addressing the specific needs of a person in crisis. Similarly, in economics and finance, AI models used to predict market behavior could be led astray if they fail to account for the irrational exuberance or panic that often drives investor decisions.

This research also calls into question the feasibility of creating “digital twins”—highly realistic AI simulations of specific individuals. The goal of a digital twin is to create a model that can predict how a person would behave in a given situation. However, if LLMs cannot even replicate the general diversity of human behavior, creating a faithful copy of a single, unique individual seems a distant prospect. The study suggests that a different architectural approach may be needed, one that goes beyond pattern recognition and incorporates more fundamental aspects of human psychology, such as memory, emotion, and personal history.

Future Research Directions

The authors of the paper propose several avenues for future research. One key area is the development of more sophisticated methods for evaluating LLM behavior. Simply comparing average outcomes is not enough. New metrics are needed that can capture the full distribution of behaviors, including the outliers and the inconsistencies. Additionally, they suggest that future studies could explore the impact of different training data on LLM performance. It may be possible to create models that are better at imitating human diversity by training them on datasets specifically designed to highlight behavioral variance.

Exploring Alternative AI Architectures

Another promising direction is the exploration of alternative AI architectures. The current generation of LLMs is based on the transformer architecture, which is highly effective for language processing but may not be the best tool for modeling human decision-making. Researchers could investigate hybrid models that combine the strengths of LLMs with insights from cognitive science and psychology. For example, an AI could be designed with separate modules for rational thought and emotional response, allowing it to generate a wider and more realistic range of behaviors. The ultimate goal is to create AI that can not only predict what people will do on average, but also understand the rich tapestry of individual human experience.

Leave a Reply

Your email address will not be published. Required fields are marked *