OpenAI announces new o3 models

OpenAI saved its biggest announcement for the last day of its 12-day “shipmas” event.

On Friday, the company unveiled o3, the successor to the o1 “reasoning” model it released earlier in the year. o3 is a model family, to be more precise — as was the case with o1. There’s o3 and o3-mini, a smaller, distilled model fine-tuned for particular tasks.

OpenAI makes the remarkable claim that o3, at least in certain conditions, approaches AGI — with significant caveats. More on that below.

o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now. https://t.co/4XlK1iHxFK

— Greg Brockman (@gdb) December 20, 2024

Why call the new model o3, not o2? Well, trademarks may be to blame. According to The Information, OpenAI skipped o2 to avoid a potential conflict with British telecom provider O2. CEO Sam Altman somewhat confirmed this during a livestream this morning. Strange world we live in, isn’t it?

Neither o3 nor o3-mini are widely available yet, but safety researchers can sign up for a preview for o3-mini starting today. An o3 preview will arrive sometime after; OpenAI didn’t specify when. Altman said that the plan is to launch o3-mini toward the end of January and follow with o3.

That conflicts a bit with his recent statements. In an interview this week, Altman said that, before OpenAI releases new reasoning models, he’d prefer a federal testing framework to guide monitoring and mitigating the risks of such models.

And there are risks. AI safety testers have found that o1’s reasoning abilities make it try to deceive human users at a higher rate than conventional, “non-reasoning” models — or, for that matter, leading AI models from Meta, Anthropic, and Google. It’s possible that o3 attempts to deceive at an even higher rate than its predecessor; we’ll find out once OpenAI’s red-team partners release their testing results.

For what it’s worth, OpenAI says that it’s using a new technique, “deliberative alignment,” to align models like o3 with its safety principles. (o1 was aligned the same way.) The company has detailed its work in a new study.

Reasoning steps

Unlike most AI, reasoning models such as o3 effectively fact-check themselves, which helps them to avoid some of the pitfalls that normally trip up models.

This fact-checking process incurs some latency. o3, like o1 before it, takes a little longer — usually seconds to minutes longer — to arrive at solutions compared to a typical non-reasoning model. The upside? It tends to be more reliable in domains such as physics, science, and mathematics.

o3 was trained via reinforcement learning to “think” before responding via what OpenAI describes as a “private chain of thought.” The model can reason through a task and plan ahead, performing a series of actions over an extended period that help it figure out a solution.

We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue. pic.twitter.com/Ia0b63RXIk

— Noam Brown (@polynoamial) December 20, 2024

In practice, given a prompt, o3 pauses before responding, considering a number of related prompts and “explaining” its reasoning along the way. After a while, the model summarizes what it considers to be the most accurate response.

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)

— Nat McAleese (@__nmca__) December 20, 2024

New with o3 versus o1 is the ability to “adjust” the reasoning time. The models can be set to low, medium, or high compute (i.e. thinking time). The higher the compute, the better o3 performs on a task.

No matter how much compute they have at their disposals, reasoning models such as o3 aren’t flawless, however. While the reasoning component can reduce hallucinations and errors, it doesn’t eliminate them. o1 trips up on games of tic-tac-toe, for instance.

Benchmarks and AGI

One big question leading up to today was whether OpenAI might claim that its newest models are approaching AGI.

AGI, short for “artificial general intelligence,” broadly refers to AI that can perform any task a human can. OpenAI has its own definition: “highly autonomous systems that outperform humans at most economically valuable work.”

Achieving AGI would be a bold declaration. And it carries contractual weight for OpenAI, as well. According to the terms of its deal with close partner and investor Microsoft, once OpenAI reaches AGI, it’s no longer obligated to give Microsoft access to its most advanced technologies (those that meet OpenAI’s AGI definition, that is).

Going by one benchmark, OpenAI is slowly inching closer to AGI. On ARC-AGI, a test designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on, o3 achieved an 87.5% score on the high compute setting. At its worst (on the low compute setting), the model tripled the performance of o1.

Granted, the high compute setting was exceedingly expensive — in the order of thousands of dollars per challenge, according to ARC-AGI co-creator François Chollet.

Today OpenAI announced o3, its next-gen reasoning model. We’ve worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.

It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task… pic.twitter.com/ESQ9CNVCEA

— François Chollet (@fchollet) December 20, 2024

Chollet also pointed out that o3 fails on “very easy tasks” in ARC-AGI, indicating — in his opinion — that the model exhibits “fundamental differences” from human intelligence. He has previously noted the evaluation’s limitations, and cautioned against using it as a measure of AI superintelligence.

“[E]arly data points suggest that the upcoming [successor to the ARC-AGI] benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training),” Chollet continued in a statement. “You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

Incidentally, OpenAI says that it’ll partner with the foundation behind ARC-AGI to help it build the next generation of its AI benchmark, ARC-AGI 2.

On other tests, o3 blows away the competition.

The model outperforms o1 by 22.8 percentage points on SWE-Bench Verified, a benchmark focused on programming tasks, and achieves a Codeforces rating — another measure of coding skills — of 2727. (A rating of 2400 places an engineer at the 99.2nd percentile.) o3 scores 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question, and achieves 87.7% on GPQA Diamond, a set of graduate-level biology, physics, and chemistry questions. Finally, o3 sets a new record on EpochAI’s Frontier Math benchmark, solving 25.2% of problems; no other model exceeds 2%.

We trained o3-mini: both more capable than o1-mini, and around 4x faster end-to-end when accounting for reasoning tokens

with @ren_hongyu @shengjia_zhao & others pic.twitter.com/3Cujxy6yCU

— Kevin Lu (@_kevinlu) December 20, 2024

These claims have to be taken with a grain of salt, of course. They’re from OpenAI’s internal evaluations. We’ll need to wait to see how the model holds up to benchmarking from outside customers and organizations in the future.

A trend

In the wake of the release of OpenAI’s first series of reasoning models, there’s been an explosion of reasoning models from rival AI companies — including Google. In early November, DeepSeek, an AI research firm funded by quant traders, launched a preview of its first reasoning model, DeepSeek-R1. That same month, Alibaba’s Qwen team unveiled what it claimed was the first “open” challenger to o1 (in the sense that it could be downloaded, fine-tuned, and run locally).

What opened the reasoning model floodgates? Well, for one, the search for novel approaches to refine generative AI. As TechCrunch recently reported, “brute force” techniques to scale up models are no longer yielding the improvements they once did.

Not everyone’s convinced that reasoning models are the best path forward. They tend to be expensive, for one, thanks to the large amount of computing power required to run them. And while they’ve performed well on benchmarks so far, it’s not clear whether reasoning models can maintain this rate of progress.

Interestingly, the release of o3 comes as one of OpenAI’s most accomplished scientists departs. Alec Radford, the lead author of the academic paper that kicked off OpenAI’s “GPT series” of generative AI models (that is, GPT-3, GPT-4, and so on), announced this week that he’s leaving to pursue independent research.

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

Reasoning steps

Benchmarks and AGI

A trend

Must Read

Leave a Comment Cancel Reply