Search

With John J. Horton.
Revise & Resubmit at Econometrica.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
Job Market Paper WISE 2025 Best Paper Award

Markets increasingly accommodate large language models (LLMs) as autonomous decision-making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task-specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi-player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity—a common benchmark in agent evaluation—can mask substantive procedural differences in how LLMs behave in complex multi-agent interactions.

General Social Agents

With Peyman Shahidi, Gili Rusak, Andrey Fradkin, and John J. Horton.
Forthcoming chapter in The Economics of Transformative AI

Useful social science theories predict behavior across settings. However, applying a theory to make predictions in new settings is challenging: rarely can it be done without ad hoc modifications to account for setting-specific factors. We argue that AI agents put in simulations of those novel settings offer an alternative for applying theory, requiring minimal or no modifications. We present an approach for building such “general” agents that use theory-grounded natural language instructions, existing empirical data, and knowledge acquired by the underlying AI during training. To demonstrate the approach in settings where no data from that data-generating process exists—as is often the case in applied prediction problems—we design a heterogeneous population of 883,200 novel games. AI agents are constructed using human data from a small set of conceptually related but structurally distinct “seed” games. In preregistered experiments, on average, agents predict initial human play in a random sample of 1,500 games from the population better than (i) a cognitive hierarchy model, (ii) game-theoretic equilibria, and (iii) out-of-the-box agents. For a small set of separate novel games, these simulations predict responses from a new sample of human subjects better even than the most plausibly relevant published human data.

The Coasean Singularity? Demand, Supply, and Market Design with AI Agents

Abstract Paper Press: Marginal Revolution Press: MIT Initiative on the Digital Economy

AI agents—autonomous systems that perceive, reason, and act on behalf of human principals—are poised to transform digital markets by dramatically reducing transaction costs. This chapter evaluates the economic implications of this transition, adopting a consumer-oriented view of agents as market participants that can search, negotiate, and transact directly. From the demand side, agent adoption reflects derived demand: users trade off decision quality against effort reduction, with outcomes mediated by agent capability and task context. On the supply side, firms will design, integrate, and monetize agents, with outcomes hinging on whether agents operate within or across platforms. At the market level, agents create efficiency gains from lower search, communication, and contracting costs, but also introduce frictions such as congestion and price obfuscation. By lowering the costs of preference elicitation, contract enforcement, and identity verification, agents expand the feasible set of market designs but also raise novel regulatory challenges. While the net welfare effects remain an empirical question, the rapid onset of AI-mediated transactions presents a unique opportunity for economic research to inform real-world policy and market design.

Automated Social Science: Language Models as Scientist and Subjects

With Kehang Zhu and John J. Horton.
Reject and Resubmit at the Quarterly Journal of Economics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
EC ‘26 Exemplary Paper Award

With John J. Horton and Apostolos Filippas.
Revise and Resubmit at the Review of Economics and Statistics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘24).

We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent advances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a language to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a negotiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.

Large Language Models as Simulated Economic Agents: What we can learn from Homo Silicus

With Eaman Jahani, Joe Zhang, Hong-Yi TuYe, Mohammed Alsobay, Christos Nicolaides, Siddharth Suri, and David Holtz.
Information Systems Research, 2026.

We argue that newly-developed large language models (LLMs), because of how they are trained and designed, are implicit computational models of humans—a Homo silicus. LLMs can be used like economists use Homo economicus: they can be given endowments, information, preferences, and so on, and then their behavior can be explored in scenarios via simulation. Experiments using this approach, derived from Charness and Rabin (2002), Kahneman et al. (1986), Samuelson and Zeckhauser (1988), Oprea (2024b), and Horton (2025), show qualitatively similar results to the original, and when they differ, it is often generative for future research. We discuss potential applications, conceptual issues, and why this approach can inform the study of humans.

Prompt Adaptation as a Dynamic Complement in Generative AI Systems

Abstract Paper Press: MIT Sloan's Ideas Made to Matter Press: Marginal Revolution Press: Columbia Business School Insights

As generative AI systems rapidly improve, a key question emerges: how do users adapt to these changes, and when does such adaptation matter for realizing performance gains? This paper studies prompt adaptation—how users adjust their inputs in response to evolving model behavior—using a common experimental design applied to two preregistered tasks with 3,750 total participants who submitted nearly 37,000 prompts. We show that the importance of prompt adaptation depends critically on task structure. In a task with fixed evaluation criteria and an unambiguous goal, user prompt adaptation accounts for roughly half of the performance gains from a model upgrade. In contrast, in an open-ended creative task where the space of acceptable outputs is effectively unbounded and quality is subjective, performance improvements are driven primarily by model capability; prompt adaptation plays a limited role. We further show that automated prompt rewriting cannot generally substitute for human adaptation: when aligned with task objectives, it can modestly improve performance, but when misaligned, it can actively undermine the gains from model improvements. Together, these findings position prompt adaptation as a dynamic complement whose importance depends on task structure and system design, and suggest that without it, a substantial share of the economic value created by advances in generative models may go unrealized.

Self-supervised Preference Learning for Multimodal Foundation Models

With Akshata Tiwari, Jillian Ross, and Andrew Lo.
Under Review at NeurIPS.

with Angela L. Duckworth, Katherine L. Milkman, and 26 others).
Proceedings of the National Academy of Sciences, 2025.

Preference optimization for multimodal foundation models typically draws its signal from external judgment: human annotations, model-based scoring, or task-specific supervision. We propose a self-supervised alternative based on a simple principle: if two images depict similar underlying data, their descriptions should be similar. We instantiate this idea for time-series data rendered as charts—a setting that is simple, broadly important, and well-suited to clean downstream tasks. Two charts are neighbors if they trace similar patterns in the underlying data; we compute these neighborhoods directly from the data, generate candidate descriptions of each image and its neighbors using the model itself, and rank candidates by their agreement with the neighborhood to form preference pairs. We apply this method to direct preference optimization on two open-source foundation models, Qwen2-VL-7B and Gemma-3-4B, across three domains: financial price paths, electrocardiograms, and seismic traces. The learned signal transfers zero-shot to held-out tasks. Relative to the supervised fine-tuning baseline, the detection of large price-moves increases by 31%, crash-risk prediction by 14%, electrocardiogram abnormality detection by 10%, and seismic event detection by 3.3%. Probes show the gains reflect stronger alignment between visual inputs and textual outputs, and ablations confirm that neither generic preference optimization nor learned-feature neighborhoods produce the same gains.

A national megastudy shows that email nudges to elementary school teachers boost student math achievement, particularly when personalized

With Linnea Gandhi and Angela L. Duckworth.
Current Directions in Psychological Science, 2024.

In response to the alarming recent decline in US math achievement, we conducted a national megastudy in which 140,461 elementary school teachers who collectively taught 2,992,027 students were randomly assigned to receive a variety of behaviorally informed email nudges aimed at improving students' progress in math. Specifically, we partnered with the nonprofit educational platform Zearn Math to compare the impact of 15 different interventions with a reminder-only megastudy control condition. All 16 conditions entailed weekly emails delivered to teachers over 4-wk in the fall of 2021. The best-performing intervention, which encouraged teachers to log into Zearn Math for an updated report on how their students were doing that week, produced a 5.06% increase in students' math progress (3.30% after accounting for the winner’s curse). In exploratory analyses, teachers who received any behaviorally informed email nudge (vs. a reminder-only megastudy control) saw their students' math progress boosted by an average of 1.89% during the 4-wk intervention period; emails referencing personalized data (i.e., classroom-specific statistics) outperformed emails that did not by 2.26%. While small in size, these intervention effects were consistent across school socioeconomic status and school type (public, private, etc.) and, further, persisted in the 8-wk post-intervention period. Collectively, these findings underscore both how difficult it is to change behavior and the need for large-scale, rigorous, empirical research of the sort undertaken in this megastudy.

Effect Size Magnification. No Variable is as Important as the One You’re Thinking About—While You’re Thinking About It

With Crystal Qian, Kehang Zhu, John J. Horton, Vivian Tsai, James Wexler, and Nithum Thain.
ACM Conference on Intelligent User Interfaces (IUI), 2026.

The goal of psychological science is to discover truths about human nature, and the typical form of empirical insights is a simple statement of the form x relates to y. We suggest that such “one-liners” imply much larger x-y relationships than those we typically study. Given the multitude of factors that compete and interact to influence any human outcome, small effect sizes should not surprise us. And yet they do—as evidenced by the persistent and systematic underpowering of research studies in psychological science. We suggest an explanation. Effect size magnification is the tendency to exaggerate the importance of the variable under investigation because of the momentary neglect of others. Although problematic, this attentional focus serves a purpose akin to that of the eye’s fovea. We see a particular x-y relationship with greater acuity when it is the center of our attention. Debiasing remedies are not straightforward, but we recommend (a) recalibrating expectations about the effect sizes we study, (b) proactively exploring moderators and boundary conditions, and (c) periodically toggling our focus from the x variable we happen to study to the non-x variables we do not.

Selected Work in Progress

AI Agents as Flexible Commitment Devices: Evidence from the Field

with Alex Moehring.

How Complex are People? Using AI Simulations to Measure the Dimensionality of Human Behaviors

with Matthew O. Jackson, Yutong Xie, Walter Yuan, and Qiaozhu Mei.

Taste-as-Program: Scalable Preference Elicitation with Artificial Intelligence

with Gili Rusak and John J. Horton.

Mass In Silico Replication of Social Science Experiments

with Jacob Snyder, John J. Horton, and Abhishek Nagaraj.

What Causes Price Bubbles? Theory and Evidence from Simulations and the Lab

with Cady Ngo, Christopher Avery, and Colin Camerer.

Strategic Tradeoffs Between Humans and AI in Multi-Agent Bargaining

With John J. Horton.
Revise & Resubmit at Econometrica.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
Job Market Paper WISE 2025 Best Paper Award

Markets increasingly accommodate large language models (LLMs) as autonomous decision-making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task-specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi-player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity—a common benchmark in agent evaluation—can mask substantive procedural differences in how LLMs behave in complex multi-agent interactions.

General Social Agents

With Peyman Shahidi, Gili Rusak, Andrey Fradkin, and John J. Horton.
Forthcoming chapter in The Economics of Transformative AI

Useful social science theories predict behavior across settings. However, applying a theory to make predictions in new settings is challenging: rarely can it be done without ad hoc modifications to account for setting-specific factors. We argue that AI agents put in simulations of those novel settings offer an alternative for applying theory, requiring minimal or no modifications. We present an approach for building such “general” agents that use theory-grounded natural language instructions, existing empirical data, and knowledge acquired by the underlying AI during training. To demonstrate the approach in settings where no data from that data-generating process exists—as is often the case in applied prediction problems—we design a heterogeneous population of 883,200 novel games. AI agents are constructed using human data from a small set of conceptually related but structurally distinct “seed” games. In preregistered experiments, on average, agents predict initial human play in a random sample of 1,500 games from the population better than (i) a cognitive hierarchy model, (ii) game-theoretic equilibria, and (iii) out-of-the-box agents. For a small set of separate novel games, these simulations predict responses from a new sample of human subjects better even than the most plausibly relevant published human data.

The Coasean Singularity? Demand, Supply, and Market Design with AI Agents

Abstract Paper Press: Marginal Revolution Press: MIT Initiative on the Digital Economy

AI agents—autonomous systems that perceive, reason, and act on behalf of human principals—are poised to transform digital markets by dramatically reducing transaction costs. This chapter evaluates the economic implications of this transition, adopting a consumer-oriented view of agents as market participants that can search, negotiate, and transact directly. From the demand side, agent adoption reflects derived demand: users trade off decision quality against effort reduction, with outcomes mediated by agent capability and task context. On the supply side, firms will design, integrate, and monetize agents, with outcomes hinging on whether agents operate within or across platforms. At the market level, agents create efficiency gains from lower search, communication, and contracting costs, but also introduce frictions such as congestion and price obfuscation. By lowering the costs of preference elicitation, contract enforcement, and identity verification, agents expand the feasible set of market designs but also raise novel regulatory challenges. While the net welfare effects remain an empirical question, the rapid onset of AI-mediated transactions presents a unique opportunity for economic research to inform real-world policy and market design.

Automated Social Science: Language Models as Scientist and Subjects

With Kehang Zhu and John J. Horton.
Reject and Resubmit at the Quarterly Journal of Economics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
EC ‘26 Exemplary Paper Award

With John J. Horton and Apostolos Filippas.
Revise and Resubmit at the Review of Economics and Statistics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘24).

We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent advances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a language to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a negotiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.

Large Language Models as Simulated Economic Agents: What we can learn from Homo Silicus

With Eaman Jahani, Joe Zhang, Hong-Yi TuYe, Mohammed Alsobay, Christos Nicolaides, Siddharth Suri, and David Holtz.
Information Systems Research, 2026.

We argue that newly-developed large language models (LLMs), because of how they are trained and designed, are implicit computational models of humans—a Homo silicus. LLMs can be used like economists use Homo economicus: they can be given endowments, information, preferences, and so on, and then their behavior can be explored in scenarios via simulation. Experiments using this approach, derived from Charness and Rabin (2002), Kahneman et al. (1986), Samuelson and Zeckhauser (1988), Oprea (2024b), and Horton (2025), show qualitatively similar results to the original, and when they differ, it is often generative for future research. We discuss potential applications, conceptual issues, and why this approach can inform the study of humans.

Prompt Adaptation as a Dynamic Complement in Generative AI Systems

Abstract Paper Press: MIT Sloan's Ideas Made to Matter Press: Marginal Revolution Press: Columbia Business School Insights

As generative AI systems rapidly improve, a key question emerges: how do users adapt to these changes, and when does such adaptation matter for realizing performance gains? This paper studies prompt adaptation—how users adjust their inputs in response to evolving model behavior—using a common experimental design applied to two preregistered tasks with 3,750 total participants who submitted nearly 37,000 prompts. We show that the importance of prompt adaptation depends critically on task structure. In a task with fixed evaluation criteria and an unambiguous goal, user prompt adaptation accounts for roughly half of the performance gains from a model upgrade. In contrast, in an open-ended creative task where the space of acceptable outputs is effectively unbounded and quality is subjective, performance improvements are driven primarily by model capability; prompt adaptation plays a limited role. We further show that automated prompt rewriting cannot generally substitute for human adaptation: when aligned with task objectives, it can modestly improve performance, but when misaligned, it can actively undermine the gains from model improvements. Together, these findings position prompt adaptation as a dynamic complement whose importance depends on task structure and system design, and suggest that without it, a substantial share of the economic value created by advances in generative models may go unrealized.

Self-supervised Preference Learning for Multimodal Foundation Models

With Akshata Tiwari, Jillian Ross, and Andrew Lo.
Under Review at NeurIPS.

with Angela L. Duckworth, Katherine L. Milkman, and 26 others).
Proceedings of the National Academy of Sciences, 2025.

Preference optimization for multimodal foundation models typically draws its signal from external judgment: human annotations, model-based scoring, or task-specific supervision. We propose a self-supervised alternative based on a simple principle: if two images depict similar underlying data, their descriptions should be similar. We instantiate this idea for time-series data rendered as charts—a setting that is simple, broadly important, and well-suited to clean downstream tasks. Two charts are neighbors if they trace similar patterns in the underlying data; we compute these neighborhoods directly from the data, generate candidate descriptions of each image and its neighbors using the model itself, and rank candidates by their agreement with the neighborhood to form preference pairs. We apply this method to direct preference optimization on two open-source foundation models, Qwen2-VL-7B and Gemma-3-4B, across three domains: financial price paths, electrocardiograms, and seismic traces. The learned signal transfers zero-shot to held-out tasks. Relative to the supervised fine-tuning baseline, the detection of large price-moves increases by 31%, crash-risk prediction by 14%, electrocardiogram abnormality detection by 10%, and seismic event detection by 3.3%. Probes show the gains reflect stronger alignment between visual inputs and textual outputs, and ablations confirm that neither generic preference optimization nor learned-feature neighborhoods produce the same gains.

A national megastudy shows that email nudges to elementary school teachers boost student math achievement, particularly when personalized

With Linnea Gandhi and Angela L. Duckworth.
Current Directions in Psychological Science, 2024.

In response to the alarming recent decline in US math achievement, we conducted a national megastudy in which 140,461 elementary school teachers who collectively taught 2,992,027 students were randomly assigned to receive a variety of behaviorally informed email nudges aimed at improving students' progress in math. Specifically, we partnered with the nonprofit educational platform Zearn Math to compare the impact of 15 different interventions with a reminder-only megastudy control condition. All 16 conditions entailed weekly emails delivered to teachers over 4-wk in the fall of 2021. The best-performing intervention, which encouraged teachers to log into Zearn Math for an updated report on how their students were doing that week, produced a 5.06% increase in students' math progress (3.30% after accounting for the winner’s curse). In exploratory analyses, teachers who received any behaviorally informed email nudge (vs. a reminder-only megastudy control) saw their students' math progress boosted by an average of 1.89% during the 4-wk intervention period; emails referencing personalized data (i.e., classroom-specific statistics) outperformed emails that did not by 2.26%. While small in size, these intervention effects were consistent across school socioeconomic status and school type (public, private, etc.) and, further, persisted in the 8-wk post-intervention period. Collectively, these findings underscore both how difficult it is to change behavior and the need for large-scale, rigorous, empirical research of the sort undertaken in this megastudy.

Effect Size Magnification. No Variable is as Important as the One You’re Thinking About—While You’re Thinking About It

With Crystal Qian, Kehang Zhu, John J. Horton, Vivian Tsai, James Wexler, and Nithum Thain.
ACM Conference on Intelligent User Interfaces (IUI), 2026.

The goal of psychological science is to discover truths about human nature, and the typical form of empirical insights is a simple statement of the form x relates to y. We suggest that such “one-liners” imply much larger x-y relationships than those we typically study. Given the multitude of factors that compete and interact to influence any human outcome, small effect sizes should not surprise us. And yet they do—as evidenced by the persistent and systematic underpowering of research studies in psychological science. We suggest an explanation. Effect size magnification is the tendency to exaggerate the importance of the variable under investigation because of the momentary neglect of others. Although problematic, this attentional focus serves a purpose akin to that of the eye’s fovea. We see a particular x-y relationship with greater acuity when it is the center of our attention. Debiasing remedies are not straightforward, but we recommend (a) recalibrating expectations about the effect sizes we study, (b) proactively exploring moderators and boundary conditions, and (c) periodically toggling our focus from the x variable we happen to study to the non-x variables we do not.

Peer-Reviewed Publications

AI Agents as Flexible Commitment Devices: Evidence from the Field

with Alex Moehring.

How Complex are People? Using AI Simulations to Measure the Dimensionality of Human Behaviors

with Matthew O. Jackson, Yutong Xie, Walter Yuan, and Qiaozhu Mei.

Taste-as-Program: Scalable Preference Elicitation with Artificial Intelligence

with Gili Rusak and John J. Horton.

Mass In Silico Replication of Social Science Experiments

with Jacob Snyder, John J. Horton, and Abhishek Nagaraj.

What Causes Price Bubbles? Theory and Evidence from Simulations and the Lab

with Cady Ngo, Christopher Avery, and Colin Camerer.

Strategic Tradeoffs Between Humans and AI in Multi-Agent Bargaining

With John J. Horton.
Revise & Resubmit at Econometrica.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
Job Market Paper WISE 2025 Best Paper Award

Markets increasingly accommodate large language models (LLMs) as autonomous decision-making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task-specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi-player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity—a common benchmark in agent evaluation—can mask substantive procedural differences in how LLMs behave in complex multi-agent interactions.

General Social Agents

With Peyman Shahidi, Gili Rusak, Andrey Fradkin, and John J. Horton.
Forthcoming chapter in The Economics of Transformative AI

Useful social science theories predict behavior across settings. However, applying a theory to make predictions in new settings is challenging: rarely can it be done without ad hoc modifications to account for setting-specific factors. We argue that AI agents put in simulations of those novel settings offer an alternative for applying theory, requiring minimal or no modifications. We present an approach for building such “general” agents that use theory-grounded natural language instructions, existing empirical data, and knowledge acquired by the underlying AI during training. To demonstrate the approach in settings where no data from that data-generating process exists—as is often the case in applied prediction problems—we design a heterogeneous population of 883,200 novel games. AI agents are constructed using human data from a small set of conceptually related but structurally distinct “seed” games. In preregistered experiments, on average, agents predict initial human play in a random sample of 1,500 games from the population better than (i) a cognitive hierarchy model, (ii) game-theoretic equilibria, and (iii) out-of-the-box agents. For a small set of separate novel games, these simulations predict responses from a new sample of human subjects better even than the most plausibly relevant published human data.

The Coasean Singularity? Demand, Supply, and Market Design with AI Agents

Abstract Paper Press: Marginal Revolution Press: MIT Initiative on the Digital Economy

AI agents—autonomous systems that perceive, reason, and act on behalf of human principals—are poised to transform digital markets by dramatically reducing transaction costs. This chapter evaluates the economic implications of this transition, adopting a consumer-oriented view of agents as market participants that can search, negotiate, and transact directly. From the demand side, agent adoption reflects derived demand: users trade off decision quality against effort reduction, with outcomes mediated by agent capability and task context. On the supply side, firms will design, integrate, and monetize agents, with outcomes hinging on whether agents operate within or across platforms. At the market level, agents create efficiency gains from lower search, communication, and contracting costs, but also introduce frictions such as congestion and price obfuscation. By lowering the costs of preference elicitation, contract enforcement, and identity verification, agents expand the feasible set of market designs but also raise novel regulatory challenges. While the net welfare effects remain an empirical question, the rapid onset of AI-mediated transactions presents a unique opportunity for economic research to inform real-world policy and market design.

Automated Social Science: Language Models as Scientist and Subjects

With Kehang Zhu and John J. Horton.
Reject and Resubmit at the Quarterly Journal of Economics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
EC ‘26 Exemplary Paper Award

With John J. Horton and Apostolos Filippas.
Revise and Resubmit at the Review of Economics and Statistics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘24).

We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent advances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a language to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a negotiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.

Large Language Models as Simulated Economic Agents: What we can learn from Homo Silicus

With Eaman Jahani, Joe Zhang, Hong-Yi TuYe, Mohammed Alsobay, Christos Nicolaides, Siddharth Suri, and David Holtz.
Information Systems Research, 2026.

We argue that newly-developed large language models (LLMs), because of how they are trained and designed, are implicit computational models of humans—a Homo silicus. LLMs can be used like economists use Homo economicus: they can be given endowments, information, preferences, and so on, and then their behavior can be explored in scenarios via simulation. Experiments using this approach, derived from Charness and Rabin (2002), Kahneman et al. (1986), Samuelson and Zeckhauser (1988), Oprea (2024b), and Horton (2025), show qualitatively similar results to the original, and when they differ, it is often generative for future research. We discuss potential applications, conceptual issues, and why this approach can inform the study of humans.

Prompt Adaptation as a Dynamic Complement in Generative AI Systems

Abstract Paper Press: MIT Sloan's Ideas Made to Matter Press: Marginal Revolution Press: Columbia Business School Insights

As generative AI systems rapidly improve, a key question emerges: how do users adapt to these changes, and when does such adaptation matter for realizing performance gains? This paper studies prompt adaptation—how users adjust their inputs in response to evolving model behavior—using a common experimental design applied to two preregistered tasks with 3,750 total participants who submitted nearly 37,000 prompts. We show that the importance of prompt adaptation depends critically on task structure. In a task with fixed evaluation criteria and an unambiguous goal, user prompt adaptation accounts for roughly half of the performance gains from a model upgrade. In contrast, in an open-ended creative task where the space of acceptable outputs is effectively unbounded and quality is subjective, performance improvements are driven primarily by model capability; prompt adaptation plays a limited role. We further show that automated prompt rewriting cannot generally substitute for human adaptation: when aligned with task objectives, it can modestly improve performance, but when misaligned, it can actively undermine the gains from model improvements. Together, these findings position prompt adaptation as a dynamic complement whose importance depends on task structure and system design, and suggest that without it, a substantial share of the economic value created by advances in generative models may go unrealized.

Self-supervised Preference Learning for Multimodal Foundation Models

With Akshata Tiwari, Jillian Ross, and Andrew Lo.
Under Review at NeurIPS.

with Angela L. Duckworth, Katherine L. Milkman, and 26 others).
Proceedings of the National Academy of Sciences, 2025.

Preference optimization for multimodal foundation models typically draws its signal from external judgment: human annotations, model-based scoring, or task-specific supervision. We propose a self-supervised alternative based on a simple principle: if two images depict similar underlying data, their descriptions should be similar. We instantiate this idea for time-series data rendered as charts—a setting that is simple, broadly important, and well-suited to clean downstream tasks. Two charts are neighbors if they trace similar patterns in the underlying data; we compute these neighborhoods directly from the data, generate candidate descriptions of each image and its neighbors using the model itself, and rank candidates by their agreement with the neighborhood to form preference pairs. We apply this method to direct preference optimization on two open-source foundation models, Qwen2-VL-7B and Gemma-3-4B, across three domains: financial price paths, electrocardiograms, and seismic traces. The learned signal transfers zero-shot to held-out tasks. Relative to the supervised fine-tuning baseline, the detection of large price-moves increases by 31%, crash-risk prediction by 14%, electrocardiogram abnormality detection by 10%, and seismic event detection by 3.3%. Probes show the gains reflect stronger alignment between visual inputs and textual outputs, and ablations confirm that neither generic preference optimization nor learned-feature neighborhoods produce the same gains.

A national megastudy shows that email nudges to elementary school teachers boost student math achievement, particularly when personalized

With Linnea Gandhi and Angela L. Duckworth.
Current Directions in Psychological Science, 2024.

In response to the alarming recent decline in US math achievement, we conducted a national megastudy in which 140,461 elementary school teachers who collectively taught 2,992,027 students were randomly assigned to receive a variety of behaviorally informed email nudges aimed at improving students' progress in math. Specifically, we partnered with the nonprofit educational platform Zearn Math to compare the impact of 15 different interventions with a reminder-only megastudy control condition. All 16 conditions entailed weekly emails delivered to teachers over 4-wk in the fall of 2021. The best-performing intervention, which encouraged teachers to log into Zearn Math for an updated report on how their students were doing that week, produced a 5.06% increase in students' math progress (3.30% after accounting for the winner’s curse). In exploratory analyses, teachers who received any behaviorally informed email nudge (vs. a reminder-only megastudy control) saw their students' math progress boosted by an average of 1.89% during the 4-wk intervention period; emails referencing personalized data (i.e., classroom-specific statistics) outperformed emails that did not by 2.26%. While small in size, these intervention effects were consistent across school socioeconomic status and school type (public, private, etc.) and, further, persisted in the 8-wk post-intervention period. Collectively, these findings underscore both how difficult it is to change behavior and the need for large-scale, rigorous, empirical research of the sort undertaken in this megastudy.

Effect Size Magnification. No Variable is as Important as the One You’re Thinking About—While You’re Thinking About It

With Crystal Qian, Kehang Zhu, John J. Horton, Vivian Tsai, James Wexler, and Nithum Thain.
ACM Conference on Intelligent User Interfaces (IUI), 2026.

The goal of psychological science is to discover truths about human nature, and the typical form of empirical insights is a simple statement of the form x relates to y. We suggest that such “one-liners” imply much larger x-y relationships than those we typically study. Given the multitude of factors that compete and interact to influence any human outcome, small effect sizes should not surprise us. And yet they do—as evidenced by the persistent and systematic underpowering of research studies in psychological science. We suggest an explanation. Effect size magnification is the tendency to exaggerate the importance of the variable under investigation because of the momentary neglect of others. Although problematic, this attentional focus serves a purpose akin to that of the eye’s fovea. We see a particular x-y relationship with greater acuity when it is the center of our attention. Debiasing remedies are not straightforward, but we recommend (a) recalibrating expectations about the effect sizes we study, (b) proactively exploring moderators and boundary conditions, and (c) periodically toggling our focus from the x variable we happen to study to the non-x variables we do not.

Book Chapters

AI Agents as Flexible Commitment Devices: Evidence from the Field

with Alex Moehring.

How Complex are People? Using AI Simulations to Measure the Dimensionality of Human Behaviors

with Matthew O. Jackson, Yutong Xie, Walter Yuan, and Qiaozhu Mei.

Taste-as-Program: Scalable Preference Elicitation with Artificial Intelligence

with Gili Rusak and John J. Horton.

Mass In Silico Replication of Social Science Experiments

with Jacob Snyder, John J. Horton, and Abhishek Nagaraj.

What Causes Price Bubbles? Theory and Evidence from Simulations and the Lab

with Cady Ngo, Christopher Avery, and Colin Camerer.

Strategic Tradeoffs Between Humans and AI in Multi-Agent Bargaining

With John J. Horton.
Revise & Resubmit at Econometrica.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
Job Market Paper WISE 2025 Best Paper Award

Markets increasingly accommodate large language models (LLMs) as autonomous decision-making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task-specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi-player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity—a common benchmark in agent evaluation—can mask substantive procedural differences in how LLMs behave in complex multi-agent interactions.

General Social Agents

With Peyman Shahidi, Gili Rusak, Andrey Fradkin, and John J. Horton.
Forthcoming chapter in The Economics of Transformative AI

Useful social science theories predict behavior across settings. However, applying a theory to make predictions in new settings is challenging: rarely can it be done without ad hoc modifications to account for setting-specific factors. We argue that AI agents put in simulations of those novel settings offer an alternative for applying theory, requiring minimal or no modifications. We present an approach for building such “general” agents that use theory-grounded natural language instructions, existing empirical data, and knowledge acquired by the underlying AI during training. To demonstrate the approach in settings where no data from that data-generating process exists—as is often the case in applied prediction problems—we design a heterogeneous population of 883,200 novel games. AI agents are constructed using human data from a small set of conceptually related but structurally distinct “seed” games. In preregistered experiments, on average, agents predict initial human play in a random sample of 1,500 games from the population better than (i) a cognitive hierarchy model, (ii) game-theoretic equilibria, and (iii) out-of-the-box agents. For a small set of separate novel games, these simulations predict responses from a new sample of human subjects better even than the most plausibly relevant published human data.

The Coasean Singularity? Demand, Supply, and Market Design with AI Agents

Abstract Paper Press: Marginal Revolution Press: MIT Initiative on the Digital Economy

AI agents—autonomous systems that perceive, reason, and act on behalf of human principals—are poised to transform digital markets by dramatically reducing transaction costs. This chapter evaluates the economic implications of this transition, adopting a consumer-oriented view of agents as market participants that can search, negotiate, and transact directly. From the demand side, agent adoption reflects derived demand: users trade off decision quality against effort reduction, with outcomes mediated by agent capability and task context. On the supply side, firms will design, integrate, and monetize agents, with outcomes hinging on whether agents operate within or across platforms. At the market level, agents create efficiency gains from lower search, communication, and contracting costs, but also introduce frictions such as congestion and price obfuscation. By lowering the costs of preference elicitation, contract enforcement, and identity verification, agents expand the feasible set of market designs but also raise novel regulatory challenges. While the net welfare effects remain an empirical question, the rapid onset of AI-mediated transactions presents a unique opportunity for economic research to inform real-world policy and market design.

Automated Social Science: Language Models as Scientist and Subjects

With Kehang Zhu and John J. Horton.
Reject and Resubmit at the Quarterly Journal of Economics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
EC ‘26 Exemplary Paper Award

With John J. Horton and Apostolos Filippas.
Revise and Resubmit at the Review of Economics and Statistics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘24).

We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent advances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a language to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a negotiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.

Large Language Models as Simulated Economic Agents: What we can learn from Homo Silicus

With Eaman Jahani, Joe Zhang, Hong-Yi TuYe, Mohammed Alsobay, Christos Nicolaides, Siddharth Suri, and David Holtz.
Information Systems Research, 2026.

We argue that newly-developed large language models (LLMs), because of how they are trained and designed, are implicit computational models of humans—a Homo silicus. LLMs can be used like economists use Homo economicus: they can be given endowments, information, preferences, and so on, and then their behavior can be explored in scenarios via simulation. Experiments using this approach, derived from Charness and Rabin (2002), Kahneman et al. (1986), Samuelson and Zeckhauser (1988), Oprea (2024b), and Horton (2025), show qualitatively similar results to the original, and when they differ, it is often generative for future research. We discuss potential applications, conceptual issues, and why this approach can inform the study of humans.

Prompt Adaptation as a Dynamic Complement in Generative AI Systems

Abstract Paper Press: MIT Sloan's Ideas Made to Matter Press: Marginal Revolution Press: Columbia Business School Insights

As generative AI systems rapidly improve, a key question emerges: how do users adapt to these changes, and when does such adaptation matter for realizing performance gains? This paper studies prompt adaptation—how users adjust their inputs in response to evolving model behavior—using a common experimental design applied to two preregistered tasks with 3,750 total participants who submitted nearly 37,000 prompts. We show that the importance of prompt adaptation depends critically on task structure. In a task with fixed evaluation criteria and an unambiguous goal, user prompt adaptation accounts for roughly half of the performance gains from a model upgrade. In contrast, in an open-ended creative task where the space of acceptable outputs is effectively unbounded and quality is subjective, performance improvements are driven primarily by model capability; prompt adaptation plays a limited role. We further show that automated prompt rewriting cannot generally substitute for human adaptation: when aligned with task objectives, it can modestly improve performance, but when misaligned, it can actively undermine the gains from model improvements. Together, these findings position prompt adaptation as a dynamic complement whose importance depends on task structure and system design, and suggest that without it, a substantial share of the economic value created by advances in generative models may go unrealized.

Self-supervised Preference Learning for Multimodal Foundation Models

With Akshata Tiwari, Jillian Ross, and Andrew Lo.
Under Review at NeurIPS.

with Angela L. Duckworth, Katherine L. Milkman, and 26 others).
Proceedings of the National Academy of Sciences, 2025.

Preference optimization for multimodal foundation models typically draws its signal from external judgment: human annotations, model-based scoring, or task-specific supervision. We propose a self-supervised alternative based on a simple principle: if two images depict similar underlying data, their descriptions should be similar. We instantiate this idea for time-series data rendered as charts—a setting that is simple, broadly important, and well-suited to clean downstream tasks. Two charts are neighbors if they trace similar patterns in the underlying data; we compute these neighborhoods directly from the data, generate candidate descriptions of each image and its neighbors using the model itself, and rank candidates by their agreement with the neighborhood to form preference pairs. We apply this method to direct preference optimization on two open-source foundation models, Qwen2-VL-7B and Gemma-3-4B, across three domains: financial price paths, electrocardiograms, and seismic traces. The learned signal transfers zero-shot to held-out tasks. Relative to the supervised fine-tuning baseline, the detection of large price-moves increases by 31%, crash-risk prediction by 14%, electrocardiogram abnormality detection by 10%, and seismic event detection by 3.3%. Probes show the gains reflect stronger alignment between visual inputs and textual outputs, and ablations confirm that neither generic preference optimization nor learned-feature neighborhoods produce the same gains.

A national megastudy shows that email nudges to elementary school teachers boost student math achievement, particularly when personalized

With Linnea Gandhi and Angela L. Duckworth.
Current Directions in Psychological Science, 2024.

In response to the alarming recent decline in US math achievement, we conducted a national megastudy in which 140,461 elementary school teachers who collectively taught 2,992,027 students were randomly assigned to receive a variety of behaviorally informed email nudges aimed at improving students' progress in math. Specifically, we partnered with the nonprofit educational platform Zearn Math to compare the impact of 15 different interventions with a reminder-only megastudy control condition. All 16 conditions entailed weekly emails delivered to teachers over 4-wk in the fall of 2021. The best-performing intervention, which encouraged teachers to log into Zearn Math for an updated report on how their students were doing that week, produced a 5.06% increase in students' math progress (3.30% after accounting for the winner’s curse). In exploratory analyses, teachers who received any behaviorally informed email nudge (vs. a reminder-only megastudy control) saw their students' math progress boosted by an average of 1.89% during the 4-wk intervention period; emails referencing personalized data (i.e., classroom-specific statistics) outperformed emails that did not by 2.26%. While small in size, these intervention effects were consistent across school socioeconomic status and school type (public, private, etc.) and, further, persisted in the 8-wk post-intervention period. Collectively, these findings underscore both how difficult it is to change behavior and the need for large-scale, rigorous, empirical research of the sort undertaken in this megastudy.

Effect Size Magnification. No Variable is as Important as the One You’re Thinking About—While You’re Thinking About It

With Crystal Qian, Kehang Zhu, John J. Horton, Vivian Tsai, James Wexler, and Nithum Thain.
ACM Conference on Intelligent User Interfaces (IUI), 2026.

The goal of psychological science is to discover truths about human nature, and the typical form of empirical insights is a simple statement of the form x relates to y. We suggest that such “one-liners” imply much larger x-y relationships than those we typically study. Given the multitude of factors that compete and interact to influence any human outcome, small effect sizes should not surprise us. And yet they do—as evidenced by the persistent and systematic underpowering of research studies in psychological science. We suggest an explanation. Effect size magnification is the tendency to exaggerate the importance of the variable under investigation because of the momentary neglect of others. Although problematic, this attentional focus serves a purpose akin to that of the eye’s fovea. We see a particular x-y relationship with greater acuity when it is the center of our attention. Debiasing remedies are not straightforward, but we recommend (a) recalibrating expectations about the effect sizes we study, (b) proactively exploring moderators and boundary conditions, and (c) periodically toggling our focus from the x variable we happen to study to the non-x variables we do not.

Refereed Conference Proceedings

AI Agents as Flexible Commitment Devices: Evidence from the Field

with Alex Moehring.

How Complex are People? Using AI Simulations to Measure the Dimensionality of Human Behaviors

with Matthew O. Jackson, Yutong Xie, Walter Yuan, and Qiaozhu Mei.

Taste-as-Program: Scalable Preference Elicitation with Artificial Intelligence

with Gili Rusak and John J. Horton.

Mass In Silico Replication of Social Science Experiments

with Jacob Snyder, John J. Horton, and Abhishek Nagaraj.

What Causes Price Bubbles? Theory and Evidence from Simulations and the Lab

with Cady Ngo, Christopher Avery, and Colin Camerer.

Strategic Tradeoffs Between Humans and AI in Multi-Agent Bargaining

With John J. Horton.
Revise & Resubmit at Econometrica.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
Job Market Paper WISE 2025 Best Paper Award

Markets increasingly accommodate large language models (LLMs) as autonomous decision-making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task-specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi-player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity—a common benchmark in agent evaluation—can mask substantive procedural differences in how LLMs behave in complex multi-agent interactions.

General Social Agents

With Peyman Shahidi, Gili Rusak, Andrey Fradkin, and John J. Horton.
Forthcoming chapter in The Economics of Transformative AI

Useful social science theories predict behavior across settings. However, applying a theory to make predictions in new settings is challenging: rarely can it be done without ad hoc modifications to account for setting-specific factors. We argue that AI agents put in simulations of those novel settings offer an alternative for applying theory, requiring minimal or no modifications. We present an approach for building such “general” agents that use theory-grounded natural language instructions, existing empirical data, and knowledge acquired by the underlying AI during training. To demonstrate the approach in settings where no data from that data-generating process exists—as is often the case in applied prediction problems—we design a heterogeneous population of 883,200 novel games. AI agents are constructed using human data from a small set of conceptually related but structurally distinct “seed” games. In preregistered experiments, on average, agents predict initial human play in a random sample of 1,500 games from the population better than (i) a cognitive hierarchy model, (ii) game-theoretic equilibria, and (iii) out-of-the-box agents. For a small set of separate novel games, these simulations predict responses from a new sample of human subjects better even than the most plausibly relevant published human data.

The Coasean Singularity? Demand, Supply, and Market Design with AI Agents

Abstract Paper Press: Marginal Revolution Press: MIT Initiative on the Digital Economy

AI agents—autonomous systems that perceive, reason, and act on behalf of human principals—are poised to transform digital markets by dramatically reducing transaction costs. This chapter evaluates the economic implications of this transition, adopting a consumer-oriented view of agents as market participants that can search, negotiate, and transact directly. From the demand side, agent adoption reflects derived demand: users trade off decision quality against effort reduction, with outcomes mediated by agent capability and task context. On the supply side, firms will design, integrate, and monetize agents, with outcomes hinging on whether agents operate within or across platforms. At the market level, agents create efficiency gains from lower search, communication, and contracting costs, but also introduce frictions such as congestion and price obfuscation. By lowering the costs of preference elicitation, contract enforcement, and identity verification, agents expand the feasible set of market designs but also raise novel regulatory challenges. While the net welfare effects remain an empirical question, the rapid onset of AI-mediated transactions presents a unique opportunity for economic research to inform real-world policy and market design.

Automated Social Science: Language Models as Scientist and Subjects

With Kehang Zhu and John J. Horton.
Reject and Resubmit at the Quarterly Journal of Economics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘26).
EC ‘26 Exemplary Paper Award

With John J. Horton and Apostolos Filippas.
Revise and Resubmit at the Review of Economics and Statistics.
Extended abstract at the ACM Conference on Economics & Computation (EC ‘24).

We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent advances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a language to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a negotiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell.

Large Language Models as Simulated Economic Agents: What we can learn from Homo Silicus

With Eaman Jahani, Joe Zhang, Hong-Yi TuYe, Mohammed Alsobay, Christos Nicolaides, Siddharth Suri, and David Holtz.
Information Systems Research, 2026.

We argue that newly-developed large language models (LLMs), because of how they are trained and designed, are implicit computational models of humans—a Homo silicus. LLMs can be used like economists use Homo economicus: they can be given endowments, information, preferences, and so on, and then their behavior can be explored in scenarios via simulation. Experiments using this approach, derived from Charness and Rabin (2002), Kahneman et al. (1986), Samuelson and Zeckhauser (1988), Oprea (2024b), and Horton (2025), show qualitatively similar results to the original, and when they differ, it is often generative for future research. We discuss potential applications, conceptual issues, and why this approach can inform the study of humans.

Prompt Adaptation as a Dynamic Complement in Generative AI Systems

Abstract Paper Press: MIT Sloan's Ideas Made to Matter Press: Marginal Revolution Press: Columbia Business School Insights

As generative AI systems rapidly improve, a key question emerges: how do users adapt to these changes, and when does such adaptation matter for realizing performance gains? This paper studies prompt adaptation—how users adjust their inputs in response to evolving model behavior—using a common experimental design applied to two preregistered tasks with 3,750 total participants who submitted nearly 37,000 prompts. We show that the importance of prompt adaptation depends critically on task structure. In a task with fixed evaluation criteria and an unambiguous goal, user prompt adaptation accounts for roughly half of the performance gains from a model upgrade. In contrast, in an open-ended creative task where the space of acceptable outputs is effectively unbounded and quality is subjective, performance improvements are driven primarily by model capability; prompt adaptation plays a limited role. We further show that automated prompt rewriting cannot generally substitute for human adaptation: when aligned with task objectives, it can modestly improve performance, but when misaligned, it can actively undermine the gains from model improvements. Together, these findings position prompt adaptation as a dynamic complement whose importance depends on task structure and system design, and suggest that without it, a substantial share of the economic value created by advances in generative models may go unrealized.

Self-supervised Preference Learning for Multimodal Foundation Models

With Akshata Tiwari, Jillian Ross, and Andrew Lo.
Under Review at NeurIPS.

with Angela L. Duckworth, Katherine L. Milkman, and 26 others).
Proceedings of the National Academy of Sciences, 2025.

Preference optimization for multimodal foundation models typically draws its signal from external judgment: human annotations, model-based scoring, or task-specific supervision. We propose a self-supervised alternative based on a simple principle: if two images depict similar underlying data, their descriptions should be similar. We instantiate this idea for time-series data rendered as charts—a setting that is simple, broadly important, and well-suited to clean downstream tasks. Two charts are neighbors if they trace similar patterns in the underlying data; we compute these neighborhoods directly from the data, generate candidate descriptions of each image and its neighbors using the model itself, and rank candidates by their agreement with the neighborhood to form preference pairs. We apply this method to direct preference optimization on two open-source foundation models, Qwen2-VL-7B and Gemma-3-4B, across three domains: financial price paths, electrocardiograms, and seismic traces. The learned signal transfers zero-shot to held-out tasks. Relative to the supervised fine-tuning baseline, the detection of large price-moves increases by 31%, crash-risk prediction by 14%, electrocardiogram abnormality detection by 10%, and seismic event detection by 3.3%. Probes show the gains reflect stronger alignment between visual inputs and textual outputs, and ablations confirm that neither generic preference optimization nor learned-feature neighborhoods produce the same gains.

A national megastudy shows that email nudges to elementary school teachers boost student math achievement, particularly when personalized

With Linnea Gandhi and Angela L. Duckworth.
Current Directions in Psychological Science, 2024.

In response to the alarming recent decline in US math achievement, we conducted a national megastudy in which 140,461 elementary school teachers who collectively taught 2,992,027 students were randomly assigned to receive a variety of behaviorally informed email nudges aimed at improving students' progress in math. Specifically, we partnered with the nonprofit educational platform Zearn Math to compare the impact of 15 different interventions with a reminder-only megastudy control condition. All 16 conditions entailed weekly emails delivered to teachers over 4-wk in the fall of 2021. The best-performing intervention, which encouraged teachers to log into Zearn Math for an updated report on how their students were doing that week, produced a 5.06% increase in students' math progress (3.30% after accounting for the winner’s curse). In exploratory analyses, teachers who received any behaviorally informed email nudge (vs. a reminder-only megastudy control) saw their students' math progress boosted by an average of 1.89% during the 4-wk intervention period; emails referencing personalized data (i.e., classroom-specific statistics) outperformed emails that did not by 2.26%. While small in size, these intervention effects were consistent across school socioeconomic status and school type (public, private, etc.) and, further, persisted in the 8-wk post-intervention period. Collectively, these findings underscore both how difficult it is to change behavior and the need for large-scale, rigorous, empirical research of the sort undertaken in this megastudy.

Effect Size Magnification. No Variable is as Important as the One You’re Thinking About—While You’re Thinking About It