
Editor's Note (February 2026): Since this research was published, DX Terminal has become the largest NFT collection on the Base network, generating over $100M in secondary market volume and over 1.27 million secondary NFT transactions in a single month during the fall of 2025.
DX Terminal operated 36,651 autonomous AI agents within a closed economic simulation over 168 continuous hours. Each agent possessed a distinct persona, writing style, and behavioral profile. Agents could execute trades, create tokens, post in a public chatbox, and respond to programmatic news events. 3,500 human participants interacted with agents via a pager mechanism but could not override agent decision-making.
The simulation produced 2.07 million trades across 5,777 agent-created tokens, generating approximately 40 billion tokens of LLM inference.
This post presents our analysis of the resulting dataset, including emergent market dynamics, model performance differentials, and the methodological confounds that constrain interpretation—particularly the difficulty of isolating agent behavior from human influence.
The simulation logged all agent activity: trades (price, quantity, agent ID, timestamp), chatbox messages, human pager instructions, agent decision contexts, and news events with corresponding market responses. The full corpus comprises approximately 40 billion tokens of inference data across all subsystems.
Agents created 5,777 tokens during the simulation. Comparison against PumpFun token data from Solana DEX markets revealed structural similarities in failure rate distributions.
On PumpFun, 81.5% of tokens attract fewer than 15 traders. The simulation produced a comparable figure of 92.7%. While absolute values differ, the distribution shape is nearly identical across the full range. At the 500+ trader threshold, both datasets converge at 0.7-0.8%. The power law holds.
New Coin Distribution: PumpFun vs DX Terminal
Two tokens achieved significant adoption:
The remaining 5,775 tokens failed to achieve meaningful traction.
These failure rate distributions appear to be structural properties of speculative markets rather than artifacts of human irrationality or market manipulation. AI agents, operating without prior market experience, independently reproduced the same power-law distributions observed in human-operated markets.
The Ising model from statistical mechanics offers an explanation.1 In the model, agents on a network tend to imitate their neighbors. Below a critical noise threshold, local alignment cascades outward: a small seed of attention snowballs into a dominant cluster while competing signals die out. At the critical point, cluster sizes follow a power-law distribution, many tiny clusters, a few enormous ones.2 In a system where new tokens can be created at near-zero cost and total attention is finite, most tokens never accumulate enough coordinated interest to escape the noise floor. The few that do capture attention through early cascading alignment, producing the extreme concentration observed in both our simulation and on PumpFun. Overconfidence among participants creates positive feedback that self-tunes the system to this critical point, making the power-law failure distribution a self-sustaining steady state.3 These dynamics also suggest that the market configuration itself functions as a form of collective memory, with prices encoding the cumulative history of all prior interactions even when individual agents retain no memory of their own.4
By simulation end, the top 1% of agents controlled 78.3% of total capital. The resulting Gini coefficient of 0.892 exceeds current U.S. wealth inequality metrics. No concentration mechanism was explicitly designed into the simulation rules.
Mean portfolio value reached $2.56M against a median of $120K, a 21.4x divergence that illustrates why mean-based return reporting in speculative markets systematically overstates typical participant outcomes.
The concentration dynamics observed, information asymmetry advantages and compounding returns from early positioning, emerged endogenously. This is similar to minimal agent-level rules producing Pareto-like wealth distributions despite similar initial conditions, as documented in the Sugarscape model.5 The Gini coefficient from our simulation falls within the range that work predicted, despite operating at substantially greater scale and complexity.
Wealth Distribution: Gini Coefficient and Mean vs Median
News events were injected into the simulation at regular intervals. We measured propagation speed and market impact across the agent network.
Approximately 92% of news items produced measurable effects on asset pricing or agent trading behavior within 10 minutes of release. Mean response latency was 4.28 minutes. 91.7% of agents demonstrated immediate response behavior.
News Propagation Through Agent Network
| News Type | Response Time | Volume Impact |
|---|---|---|
| Price Drop | 0.03 min | 82.0 |
| Warning | 0.13 min | 86.7 |
| Price Surge | 4.89 min | 116.2 |
| General | 5.38 min | 93.3 |
| Hype | 8.03 min | 121.9 |
Negative information propagated significantly faster than positive information. Price drop events triggered repositioning within two seconds. Positive news propagated more slowly but produced higher aggregate volume impact upon arrival.
Market Response Times by News Type
This asymmetry aligns with Kahneman and Tversky's prospect theory (1979): losses produce stronger and faster behavioral responses than equivalent gains. The response latencies observed in the simulation—2–10 minutes for negative events, 5–30 minutes for positive events—also correspond to empirically measured propagation rates across financial social media platforms. These behavioral patterns emerged without explicit training on market dynamics or behavioral economics.
44.5% of agents received at least one human instruction via pager during the simulation. Among the top 100 performing agents, this figure rises to 87%. The highest-performing unassisted agent (ScratchGuard) achieved a portfolio value of $909M, approximately 7,500x the median. But 9 of the top 10 agents by final portfolio value received human input.
The correlation between human involvement and top-tier performance is striking. Despite the difficulty of guiding agents through a chaotic, information-dense environment with limited pager controls, players who engaged actively produced meaningfully better outcomes than AI alone. This suggests a real human-AI skill component: there is genuine strategy in how you prompt, when you intervene, and which signals you prioritize. Markets are unpredictable, but the data shows that human judgment combined with AI execution outperforms random chance by a significant margin.
This is an important design signal, especially at the current stage of agentic development. These were relatively small models (7B-72B parameters). The fact that human interaction was a net positive influence, not just noise or interference, suggests that human-AI collaboration in trading environments has real value. As agentic capabilities improve, the nature of this collaboration will shift, but the baseline finding is clear: there is skill here, and it matters.
Human-AI performance distribution
Each agent ran on a dual-model framework: a smaller model for fast reasoning and trade decisions, paired with a larger model for robust analysis and composing messages to other agents. Three model families were deployed:
| Model Framework | Agent Share | Median Normalized Return |
|---|---|---|
| Qwen 2.5 (7B + 72B) | ~40% | 188% |
| DeepSeek V3 | ~10% | 157% |
| Llama 3.1 (8B + 70B) | ~40% | 100% (baseline) |
The leading hypothesis attributes this differential to inference latency rather than model capability. Qwen executed faster inference, and in volatile market conditions, execution speed appeared to matter more than marginal improvements in decision quality. During stable market periods the performance gap narrowed; during high-volatility periods it widened. In this specific environment, latency advantages dominated.
Clustering analysis identified 6,434 distinct writing style groups among 36,651 agents, ranging from formal technical analysts to agents that typed exclusively in reverse character order.
Writing Style Categories
The diversity of linguistic output was one of the most impressive results of the experiment. With well-crafted persona creation, structured seating criteria, and rigorous generation processes, LLMs produced far more robust and varied writing styles than most people expect.6 We see persona-driven data generation as a significantly underserved area of interest, especially when considering the potential for training on persona-generated synthetic data to produce high-quality, human-like outputs where real data is unavailable.
Lexical diversity—measured as unique word usage outside standard trading vocabulary (bullish, bearish, moon, etc.)—correlated positively with portfolio performance. This correlation held across model types, ruling out model capability as a confounding variable. One interpretation: LLMs under uncertainty tend toward repetitive prompt-derived language. Agents exhibiting higher lexical diversity may have been operating in higher-confidence inference states, producing more decisive trading behavior.
Linguistic Diversity Analysis: Vocabulary Drives Success
Social influence, measured by mention and reply frequency from other agents, exhibited a non-linear relationship with performance. A minimum threshold of social visibility was required for competitive returns—agents below this threshold lacked access to information flow and token launch audiences. Beyond a second threshold, additional social engagement correlated with declining returns, consistent with time allocation trade-offs between communication and execution.
Social Influence vs Trading Performance
Location within Terminal City's multiple zones also influenced outcomes. High-density zones provided greater information flow and larger audiences for token launches. However, 3 of the top 5 final performers operated from lower-density zones, suggesting that information exclusivity can offset the network effects of density.
Agent compliance with human pager instructions was unreliable. A given instruction to purchase a specific token or maintain a position might be approximately followed, partially followed, or disregarded entirely. LLM inference is probabilistic, and the resulting variance in instruction compliance was high.
Within a simulated capital environment, this unreliability produced useful data and contributed to emergent dynamics. In any system deploying real capital, the gap between intended and executed actions represents an unacceptable risk surface.
DX Terminal operated on DeepSeek, Llama 3.1, and Qwen 2.5. Post-simulation testing extended to Claude 3.5 Sonnet, GPT-4o, and additional models. As of writing, no model demonstrated reliable agentic task completion. Function calling capabilities were implemented via supervised fine-tuning on synthetic data, adequate for structured single-turn actions, insufficient for multi-step autonomous operation.
LLM Instruction Following Benchmark
The Berkeley Function Calling Leaderboard (BFCL) placed top models at 70–88% accuracy on single-turn function calls: JSON formatting, correct function selection, valid parameter construction.
Multi-step compliance presented a different picture. LLM-as-judge evaluation across 1,000+ pager instructions yielded 40–50% accuracy depending on model and instruction complexity. An instruction such as "buy HOTDOGZ if it dips below 0.001" produced correct execution roughly half the time. Failure modes included incorrect token selection, incorrect timing, and autonomous reinterpretation of conditional parameters.
These limitations reflect the state of the field rather than specific model deficiencies. Agentic reinforcement learning—optimization for task completion across multi-step environments—was nascent at the time of writing. The gap between formatting valid actions and executing correct ones remained substantial.
METR's longitudinal tracking of autonomous AI task duration shows a consistent doubling period of approximately seven months. GPT-3 (2020) sustained autonomous operation for seconds. As of writing, o1 and Sonnet 3.7 handle thirty-minute tasks at roughly 50% success rate.
METR: Task length AIs can do is doubling every 7 months
Terminal's instruction-following data from May 2025 aligns with the METR trendline. Re-running identical instructions through reasoning models six months later produced approximately 2x improvement in compliance rates. Both datasets indicate the same trajectory independently.
The primary driver is reinforcement learning. GRPO and related techniques continue to extract performance gains that supervised fine-tuning cannot capture. If the current doubling rate holds, multi-hour autonomous operation becomes standard within twelve months—the threshold at which deploying agents with real capital transitions from speculative to practical.
DX Terminal was designed to stress-test agents under conditions that benchmarks cannot replicate: live markets with 36,000 concurrent agents, organic social dynamics, and adversarial information environments. The simulation operated with simulated capital. The next iteration will not.
The two most consistent pieces of feedback from DX Terminal participants were straightforward: they wanted to play with real money, and they wanted their agents to actually do what they told them to do.
We see a future where both are solved. We have done further independent work on both fronts, building foundational trading models trained on DEX market data across EVM and Solana networks, and improving agentic harnesses to close the instruction-following gap. We still believe the single most important requirement for scaling this to real capital in the right setting and architecture is reliable agentic instruction following.
Update (February 2026): Beginning in October 2025, we saw the advent of multiple models meeting our criteria for improved agentic capabilities and instruction-following harnesses. This led us to begin developing the next evolution of DX Terminal for release in 2026.
In the Press:
Cont, R. and Bouchaud, J-P., "Herd Behavior and Aggregate Fluctuations in Financial Markets," Macroeconomic Dynamics 4(2), 2000. Establishes that herding clusters on social networks follow power-law size distributions via percolation theory, directly producing fat-tailed market outcomes. ↩
Bornholdt, S., "Expectation Bubbles in a Spin Model of Markets: Intermittency from Frustration Across Scales," Int. J. Mod. Phys. C 12(5), 2001. Models the lifecycle of speculative assets as Ising-like phase transitions between local herding (bubble nucleation) and global saturation (collapse). ↩
Sornette, D. and Zhou, W-X., "Importance of Positive Feedbacks and Over-confidence in a Self-Fulfilling Ising Model of Financial Markets," Physica A 370, 2006. Shows that overconfident agents create positive feedback loops that self-tune markets to the critical point, making power-law failure distributions a self-sustaining steady state. ↩
Sznajd-Weron, K. and Sznajd, J., "Opinion Evolution in Closed Community," Int. J. Mod. Phys. C 11(6), 2000. The Sznajd model demonstrates social validation dynamics where aligned agents convince their neighbors, producing herding cascades. Subsequent work by Boschi, Cammarota, and Kuhn ("Opinion dynamics with emergent collective memory," Physica A 558, 2020) showed that in such models, the interaction network itself functions as a Hopfield-like collective memory: the system configuration encodes the full history of prior external stimuli, even when individual agents possess no explicit memory. This parallels how AMM reserve states encode all prior trade history as external memory. See also Sznajd-Weron et al., "Toward Understanding of the Social Hysteresis," Perspectives on Psychological Science 19(2), 2024, which frames hysteresis as collective memory in agent-based social systems. ↩
Epstein, J.M. and Axtell, R., Growing Artificial Societies: Social Science from the Bottom Up, Brookings Institution Press/MIT Press, 1996. The Sugarscape model demonstrated that minimal agent-level rules produce Pareto-like wealth distributions without built-in inequality mechanisms. ↩
Ge, T. et al., "Scaling Synthetic Data Creation with 1,000,000,000 Personas," arXiv:2406.20094, Tencent AI Lab, 2024. Persona Hub demonstrates that persona-driven prompting produces highly diverse synthetic data at scale. See also Groeneveld et al., "OLMo: Accelerating the Science of Language Models," ACL 2024, arXiv:2402.00838, which contributes to the open ecosystem for training and evaluating persona-driven language models. ↩