Amid today’s AI boom, it’s disconcerting that we still don’t know how to measure how smart, creative, or empathetic these systems are. Our tests for these traits, never great in the first place, were made for humans, not AI. Plus, our recent paper testing prompting techniques finds that AI test scores can change dramatically based simply on how questions are phrased. Even famous challenges like the Turing Test, where humans try to differentiate between an AI and another person in a text conversation, were designed as thought experiments at a time when such tasks seemed impossible. But now that a new paper shows that AI passes the Turing Test, we need to admit that we really don’t know what that actually means.
So, it should come as little surprise that one of the most important milestones in AI development, Artificial General Intelligence, or AGI, is badly defined and much debated. Everyone agrees that it has something to do with the ability of AIs to perform human-level tasks, though no one agrees whether this means expert or average human performance, or how many tasks and which kinds an AI would need to master to qualify. Given the definitional morass surrounding AGI, illustrating its nuances and history from its precursors to its initial coining by Shane Legg, Ben Goertzel and Peter Voss to today is challenging. As an experiment in both substance and form (and speaking of potentially intelligent machines) I delegated the work entirely to AI. I had Google Deep Research put together a really solid 26 page summary on the topic. I then had HeyGen turn it into a video podcast discussion between a twitchy AI-generated version of me and an AI-generated host. It’s not actually a bad discussion (though I don’t fully agree with AI-me), but every part of it, from the research to the video to the voices is 100% AI generated.
Given all this, it was interesting to see this post by influential economist and close AI observer Tyler Cowen declaring that o3 is AGI. Why might he think that?
First, a little context. Over the past couple of weeks, two new AI models, Gemini 2.5 Pro from Google and o3 from OpenAI were released. These models, along with a set of slightly less capable but faster and cheaper models (Gemini 2.5 Flash, o4-mini, and Grok-3-mini), represent a pretty large leap in benchmarks. But benchmarks aren’t everything, as Tyler pointed out. For a real-world example of how much better these models have gotten, we can turn to my book. To illustrate a chapter on how AIs can generate ideas, a little over a year ago I asked ChatGPT-4 to come up with marketing slogans for a new cheese shop:
Today I gave the latest successor to GPT-4, o3, an ever so slightly more involved version of the same prompt: “Come up with 20 clever ideas for marketing slogans for a new mail-order cheese shop. Develop criteria and select the best one. Then build a financial and marketing plan for the shop, revising as needed and analyzing competition. Then generate an appropriate logo using image generator and build a website for the shop as a mockup, making sure to carry 5-10 cheeses that fit the marketing plan.” With that single prompt, in less than two minutes, the AI not only provided a list of slogans, but ranked and selected an option, did web research, developed a logo, built marketing and financial plans, and launched a demo website for me to react to. The fact that my instructions were vague, and that common sense was required to make decisions about how to address them, was not a barrier.
In addition to being, presumably, a larger model than GPT-4, o3 also works as a Reasoner – you can see its “thinking” in the initial response. It also is an agentic model, one that can use tools and decide how to accomplish complex goals. You can see how it took multiple actions with multiple tools, including web searches and coding, to come up with the extensive results that it did.
And this isn’t the only extraordinary examples, o3 can also do an impressive job guessing locations from photos if you just give it an image and prompt “be a geo-guesser” (with some quite profound privacy implications). Again, you can see the agentic nature of this model at work, as it zooms into parts of the picture, adds web searches, and does multi-step processes to get the right answer.
Or I gave o3 a large dataset of historical machine learning systems as a spreadsheet and asked “figure out what this is and generate a report examining the implications statistically and give me a well-formatted PDF with graphs and details” and got a full analysis with a single prompt. (I did give it some feedback to make the PDF better, though, as you can see).
This is all pretty impressive stuff and you should experiment with these models on your own. Gemini 2.5 Pro is free to use and as “smart” as o3, though it lacks the same full agentic ability. If you haven’t tried it or o3, take a few minutes to do it now. Try giving Gemini an academic paper and asking it to turn the paper into a game or have it brainstorm with you for startup ideas, or just ask for the AI to impress you (and then keep saying “more impressive”). Ask the Deep Research option to do a research report on your industry, or to research a purchase you are considering, or to develop a marketing plan for a new product.
You might find yourself “feeling the AGI” as well. Or maybe not. Maybe the AI failed you, even when you gave it the exact same prompt I used. If so, you just encountered the jagged frontier.
My co-authors and I coined the term “Jagged Frontier” to describe the fact that AI has surprisingly uneven abilities. An AI may succeed at a task that would challenge a human expert but fail at something incredibly mundane. For example, consider this puzzle, a variation on a classic old brainteaser (a concept first explored by Colin Fraser and expanded by Riley Goodside): “A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, “I can operate on this boy!” How is this possible?”
o3 insists the answer is “the surgeon is the boy’s mother,” which is wrong, as a careful reading of the brainteaser will show. Why does the AI come up with this incorrect answer? Because that is the answer to the classic version of the riddle, meant to expose unconscious bias: “A father and son are in a car crash, the father dies, and the son is rushed to the hospital. The surgeon says, ‘I can’t operate, that boy is my son,’ who is the surgeon?” The AI has “seen” this riddle in its training data so much that even the smart o3 model fails to generalize to the new problem, at least initially. And this is just one example of the kinds of issues and hallucinations that even advanced AIs can fall prey to, showing how jagged the frontier can be.
But the fact that the AI often messes up on this particular brainteaser does not take away from the fact that it can solve much harder brainteasers, or that it can do the other impressive feats I have demonstrated above. That is the nature of the Jagged Frontier. In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” – superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t. Of course, models are likely to become smarter, and a good enough Jagged AGI may still beat humans at every task, including in ones the AI is weak in.
Returning to Tyler’s post, you will notice that, despite thinking we have achieved AGI, he doesn’t think that threshold matters much to our lives in the near term. That is because, as many people have pointed out, technologies do not instantly change the world, no matter how compelling or powerful they are. Social and organizational structures change much more slowly than technology, and technology itself takes time to diffuse. Even if we have AGI today, we have years of trying to figure out how to integrate it into our existing human world.
Of course, that assumes that AI acts like a normal technology, and one whose jaggedness will never be completely solved. There is the possibility that this may not be true. The agentic capabilities we’re seeing in models like o3, like the ability to decompose complex goals, use tools, and execute multi-step plans independently, might actually accelerate diffusion dramatically compared to previous technologies. If and when AI can effectively navigate human systems on its own, rather than requiring integration, we might hit adoption thresholds much faster than historical precedent would suggest.
And there’s a deeper uncertainty here: are there capability thresholds that, once crossed, fundamentally change how these systems integrate into society? Or is it all just gradual improvement? Or will models stop improving in the future as LLMs hit a wall? The honest answer is we don’t know.
What’s clear is that we continue to be in uncharted territory. The latest models represent something qualitatively different from what came before, whether or not we call it AGI. Their agentic properties, combined with their jagged capabilities, create a genuinely novel situation with few clear analogues. It may be that history continues to be the best guide, and that figuring out how to successfully apply AI in a way that shows up in the economic statistics may be a process measured in decades. Or it might be that we are on the edge of some sort of faster take-off, where AI-driven change sweeps our world suddenly. Either way, those who learn to navigate this jagged landscape now will be best positioned for what comes next… whatever that is.