OpenAI’s latest reasoning models, o3 and o4-mini, hallucinate more frequently than the company’s previous AI systems, according to both internal testing and third-party research. On OpenAI’s PersonQA benchmark, o3 hallucinated 33% of the time — double the rate of older models o1 (16%) and o3-mini (14.8%). The o4-mini performed even worse, hallucinating 48% of the time. Nonprofit AI lab Transluce discovered o3 fabricating processes it claimed to use, including running code on a 2021 MacBook Pro “outside of ChatGPT.” Stanford adjunct professor Kian Katanforoosh noted his team found o3 frequently generates broken website links.
OpenAI says in its technical report that “more research is needed” to understand why hallucinations worsen as reasoning models scale up.