Last edited Wed Feb 4, 2026, 04:50 PM - Edit history (1)
"Early on, we experimented with using an AI model with Google's search data, but we found it wasn't very good on its own," said lead author Akari Asai, a research scientist at Ai2 who completed this research as a UW doctoral student in the Allen School.
"It might cite some research papers that weren't the most relevant, or cite just one paper, or pull from a blog post randomly. We realized we needed to ground this in scientific papers. We then made the system flexible so that it could incorporate emerging research through results."
. . .
The team compared OpenScholar against other state-of-the-art AI models, such as OpenAI's GPT-4o and two models from Meta. ScholarQABench automatically evaluated AI models' answers on metrics such as their accuracy, writing quality and relevance.
OpenScholar outperformed all the systems it was tested against. The team had 16 scientists review answers from the models and compare them with human-written responses.
The scientists preferred OpenScholar answers to human answers 51% of the time, but when they combined OpenScholar citation methods and pipelines with GPT-4o (a much bigger model), the scientists preferred the AI written answers to human answers 70% of the time. They picked answers from GPT-4o on its own only 32% of the time.