Why Every AI Chatbot Seems to Give the Same Advice? The Artificial Hivemind Effect, Explained

When different AI models, trained by different companies, keep giving you the same answer, that is not a coincidence. In this piece, Chetanya Puri, Senior Machine Learning Engineer at CluePoints, explains why researchers now have a name for it, a dataset to measure it, and evidence that the problem runs deeper than anyone’s system prompt.

Ask two different AI chatbots to write a metaphor about time. Both will almost certainly reach for the same handful of images: time as a river, time as a thief, time as a weaver pulling threads. The phrasing shifts slightly between them, but the underlying idea rarely does. Most people chalk this up to coincidence or confirmation bias, a pattern their brain is inventing. A NeurIPS 2025 Best Paper argues it is something more systematic, and considerably more consequential.

The paper introduces a term for the pattern: the Artificial Hivemind effect. On prompts where dozens of valid answers exist, language models tend to collapse into unusually narrow output distributions. The uncomfortable finding is that this holds across models built by different companies, trained on different data, and evaluated on different benchmarks. Switching to a different AI turns out to be far less of an escape hatch than most users assume.

The researchers draw a clean distinction between two related phenomena. Intra-model repetition describes a single model that, when sampled many times on the same prompt, keeps circling the same semantic territory. Inter-model homogeneity describes different models independently landing on strikingly similar answers. The first is a nuisance. The second is the one worth worrying about, because it suggests the problem is structural rather than something a better model will simply fix.

How researchers measured the collapse

Most standard benchmarks are designed around prompts with one correct answer: trivia, mathematics, code, factual retrieval. The Artificial Hivemind paper focuses instead on the messier prompts that dominate real usage: brainstorming, rewriting, advice, creative help, personal decisions. To study these at scale, the authors constructed Infinity-Chat, a dataset of roughly 26,000 real-world open-ended queries drawn from actual user interactions, annotated with 31,250 human ratings and pairwise preferences. The NeurIPS committee summary notes evaluation across more than 70 models, and the dataset is taxonomized into six top-level categories and 17 subcategories, which matters because open-ended is not one thing. A brainstorm prompt and a personal advice prompt fail in different ways when models converge.

ALSO READ: LLM SEO: How to Get Your Brand Mentioned by AI?

The core methodology samples many responses per prompt, embeds them, and measures average pairwise semantic similarity within the response set. A diverse model would show low similarity across 50 samples of the same prompt. What the researchers find instead: in 79 percent of prompts, average similarity across sampled responses exceeds 0.8, even under sampling settings specifically designed to increase variety. Cross-model analysis finds substantial overlap between outputs from different models, including noticeable phrase-level repetition on prompts that should theoretically support enormous answer diversity.

Why the convergence happens without any coordination

The effect does not require conspiracy as it emerges from several normal incentives stacking on top of each other. Training data clusters around mainstream phrasing because the internet has a centre of gravity: repeated advice, conventional wisdom, average-shaped sentences. Gradient updates reward patterns seen often, so models absorb the mean of the distribution and systematically under-learn the edges.

Preference tuning compounds this further. Reinforcement Learning from Human Feedback rewards outputs that most annotators will not actively dislike, which means outputs carrying strong opinions, unusual framings, or sharp humour get deprioritised during alignment because they generate rater disagreement. An ICLR 2024 analysis of RLHF explicitly documents this tradeoff: alignment improves generalisation but measurably reduces output diversity. A 2025 paper on what researchers call typicality bias goes further, arguing that annotators systematically prefer familiar, conventional text, and that this preference is a direct driver of mode collapse after fine-tuning.

Product defaults do the rest. Low temperature settings, conservative decoding, and safety filters all suppress long-tail outputs, pushing multiple runs of the same model, across multiple products built on the same underlying weights, into the same narrow probability basin. The final layer is competitive: everyone is optimising for the same scorecard. Benchmarks reward coherence, harmlessness, and fluency more than novelty, so teams build toward what gets measured, and the polished-assistant voice becomes a gravitational attractor across the entire industry.

What it looks like in practice

The effect is easy to demonstrate with everyday prompts. Ask for ten creative weekend date ideas and almost every model returns the same list: picnic, museum, hike, cooking class, movie night, board games, coffee shop, local market, sunset walk, try a new restaurant. The outputs are competent, and inoffensive, but also indistinguishable from each other.

Reframe the same request with hard constraints, say under €20, works in bad weather, no food involved, doable in a small apartment, make the six ideas different from each other, and for each explain why an assistant would not suggest it first, and the default list collapses entirely. The constraints eliminate the basin the model was defaulting to, and the instruction to explain its own avoidance forces something closer to genuine reasoning.

The same dynamic plays out in advice prompts. “I feel stuck in my career. What should I do?” reliably produces: reflect on your goals, learn new skills, update your resume, network, set short-term milestones. Every word of that list is defensible and very little of it is useful. Demanding a different structure, five diagnostic questions first, then two 30-day plans with named failure modes, produces something with actual edges to it. Generic advice avoids sharp edges by design, and the only way past that is to make the prompt demand them explicitly.

What you can do about it

For anyone running AI in a production or research workflow, treating diversity as an explicit metric rather than a vague aspiration is the practical takeaway from this paper. The methodology is straight to replicate at small scale. Pick a prompt that repeats in your domain, sample it multiple times across different models, embed the outputs, and measure average pairwise similarity. Heavy overlap means you are sampling the centre of the distribution. That is not always a problem, but it explains why model comparisons so often feel like evaluating near-identical products despite very different marketing.

Beyond measurement, the interventions that tend to help share a common logic: push the model away from its default basin without sacrificing coherence. Ask for strategies that cannot all be true at once. Request anti-examples alongside recommendations. Force constraints that change the shape of the answer rather than just the surface of it. Separate ideation from verification so the model can be expansive first and precise later.

There is a prompt that cuts to the heart of all of this. Ask any model: give me a genuinely unusual idea that still works in the real world, then explain why assistants tend to avoid suggesting it. If it cannot articulate the trap, it is probably still inside it.

Chetanya Puri is a Senior Machine Learning Engineer at CluePoints in Brussels, Belgium, where his work spans machine learning and natural language processing. Previously, he was an Early Stage Researcher and PhD candidate at KU Leuven, focused on scalable machine learning under limited data conditions and time-series analysis. He has held industry research roles at Philips Research and TCS, with a technical background spanning Bayesian methods, NLP, and time-series analysis.

MORE ON AI MODELS
OpenAI Launches GPT-5.4, Its Most Advanced Work Model Yet
Without System Maturity, AI Remains Just a Tool: Hector’s Meher Patel on What Brands Keep Getting Wrong
Nano Banana 2 is Google’s Fastest Image AI Yet
AI Attribution vs GA4: What Really Drives Conversions in 2026?