-snip-
In any article that comes across WIREDs fact-checking desk, theres usually a decent amount of b-matter: statistics, news events, quotes, anything that helps contextualize the topic. Fact-checkers tend to Google this basic information, and that process, in the form of the search engines dreaded AI Overviews, constitutes my main interaction with AI. In my professional opinion, its unusablewrongabout a third of the time.
This might be a generous assessment, though. A March 2025 study from the Tow Center for Digital Journalism found that more than 60 percent of responses from AI-powered search engines were inaccurate. A BBC study puts the wrongness of chatbots closer to 45 percent, the number I see cited more often. Because percentages are distancing, let me put this more plainly: AI could be wrong about half the time.
Does it matter which model? Elon Musk has said Grok is the smartest, but I havent seen much research that agrees. Claude led the pack in RealFactBench, a fact-checking-focused benchmark test developed by computer scientists in China and the UK last year. It scored 73 percent accuracy across all metrics. (To be fair, Grok was not assessed.) Another benchmark, SimpleQA, developed by OpenAI in October 2024, posed more than 4,000 single-answer questions to models from OpenAI and Anthropic. None of the models exceeded 50 percent accuracy. Google updated the benchmark earlier this year, winnowing the question set to 1,000. Gemini 2.5 Pro came out on top, with 55.6 percent accuracy.
Then theres the models own assessments. When I asked ChatGPT how accurate the major LLMs are, it told me that most models had 90 to 96 percent accuracy on some professional-style tests. It then offered a link, confusingly, to a paper on a sleep medicine certification exam. On general real-world questions, it simply offered me the rate at which models like it have been shown to hallucinate: 1 to 2 percent, apparently, though when I tried to click through to that referenced source, it didnt exist.
-snip-