AI Chatbots: Lost in Translation and Stuck on Retrieval

In a world increasingly shaped by AI, chatbots are becoming a major conduit for news delivery. But are they as reliable as we think? A recent 14-day evaluation of six AI chatbots from February 2026 sheds light on their capabilities and shortcomings. These bots, including Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, and GPT-4o mini, handled 2,100 questions based on same-day BBC News across six regions. The results are a mixed bag.

High Accuracy, Hidden Bias

At first glance, the bots seem to perform well. Over 90% accuracy on multiple-choice questions about events reported just hours earlier sounds impressive. But look closer. When the evaluation shifts to free-response questions, accuracy plummets by 11-13%, and it's even worse across the cohort, dropping 16-17%. That's a substantial dip.

The real question is, who benefits from these systems? They're falling short in areas that matter. For instance, every model struggles most with Hindi queries. Their accuracy drops to 79%, compared to 89-91% for other languages. Why? Anglophone retrieval bias. Even when answering Hindi questions, these models cite English Wikipedia more than any Hindi source. Whose data? Whose labor? Whose benefit?

Retrieval Missteps: The Real Culprit

The study reveals that retrieval failures drive over 70% of all errors. When these chatbots stumble, it's often because they can't find the right source, not because they can't reason through the facts. It's a bit like being a good student who can't find the textbook before the exam.

And let's talk about subtle false premises. The chatbots achieved 88-96% accuracy on well-formed questions but fell to a shocking 19-70% when faced with tricky, misleading questions. The most vulnerable model alarmingly accepted fabricated facts 64% of the time. This isn't just a tech problem. It's a story about power, not just performance.

A Paradox of Detection

There's an interesting twist here, the detection-accuracy paradox. The best false-premise detector ranks second in adversarial accuracy, while a weaker detector ranks first. This shows that detecting false premises and recovering accurate answers aren't the same skill. It raises a critical question: Are we prioritizing the wrong capabilities in AI development?

Ultimately, the study suggests that high accuracy masks deeper issues. These chatbots rely heavily on retrieval infrastructure, and their performance can be uneven, especially for non-English queries. As we increasingly depend on AI for information, it's essential to address these gaps. Whose needs are we really catering to? The benchmark doesn't capture what matters most.