So the big news this week is that o3, OpenAI’s new language model, got 25% on FrontierMath. Let’s start by explaining what this means.
What is o3? What is FrontierMath?
A language model, as probably most people know, is one of these things like ChatGPT where you can ask it a question and it will write some sentences which are an attempt to give you an answer. There were language models before ChatGPT, and on the whole they couldn’t even write coherent sentences and paragraphs. ChatGPT was really the first public model which was coherent. There have been many other models since. Right now they’re still getting better really fast. How much longer this will go on for nobody knows, but there are lots of people pouring lots of money into this game so it would be a fool who bets on progress slowing down any time soon. o3 is a new language model.
FrontierMath is a secret dataset of “hundreds” of hard maths questions, curated by Epoch AI, and announced last month. “Hundreds” is a quote from the paper (first line of the abstract), but I’ve heard a rumour that when the paper came out there were under 200 questions, although I’ve heard another rumour that apparently more are have been added since. As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me a little that already in this paragraph we’ve seen more questions than answers — I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so you moment you make a database of maths questions public, the language models will train on it. And then if you ask such a model a question from the database they’ll probably just rattle off the answer which they already saw.
How hard is the FrontierMath dataset?
So what are the questions in the FrontierMath dataset like? Here’s what we know. They’re not “prove this theorem!” questions, they’re “find this number!” questions. More precisely, the paper says “Problems had to possess definitive, computable answers that could be automatically verified”, and in the five sample problems which were made public from the dataset (Appendix A of the paper, pages 14 to 23) the solutions are all positive whole numbers (one answer is 9811, another is 367707, and the final three solutions are even larger — clearly these questions are designed in such a way that random guesswork is extremely unlikely to succeed). The sample questions are nontrivial, even to a research mathematician. I understood the statements of all five questions. I could do the third one relatively quickly (I had seen the trick before that the function mapping a natural n to alpha^n was p-adically continuous in n iff the p-adic valuation of alpha-1 was positive) and I knew exactly how to do the 5th one (it’s a standard trick involving the Weil conjectures for curves) but I didn’t bother doing the algebra to work out the exact 13-digit answer. The first and second question I knew I couldn’t do, and I figured I might be able to make progress on the 4th one if I put some real effort in, but ultimately I didn’t attempt it, I just read the solution. I suspect that a typical smart mathematics undergraduate would struggle to do even one of these questions. To do the first one you would, I imagine, have to be at least a PhD student in analytic number theory. The FrontierMath paper contains some quotes from mathematicians about the difficulty level of the problems. Tao (Fields Medal) says “These are extremely challenging” and suggests that they can only be tackled by a domain expert (and indeed the two sample questions which I could solve are in arithmetic, my area of expertise; I failed to do all of the ones outside my area). Borcherds (also Fields Medal) however is quoted in the paper as saying that machines producing numerical answers “aren’t quite the same as coming up with original proofs”.
So why make such a dataset? The problem is that grading solutions to “hundreds” of answers to “prove this theorem!” questions is expensive (one would not trust a machine to do grading at this level, at least in 2024, so one would have to pay human experts), whereas checking whether hundreds of numbers in one list correspond to hundreds of numbers in another list can be done in a fraction of a second by a computer. As Borcherds pointed out, mathematics researchers spend most of the time trying to come up with proofs or ideas, rather than numbers, however the FrontierMath dataset is still extremely valuable because the area of AI for mathematics is desperately short of hard datasets, and creating a dataset such as this is very hard work (or equivalently very expensive).
So there was an article about the dataset in Science and I was quoted in it as saying “If you have a system that can ace that database, then it’s game over for mathematicians.” Just to be clear: I had nothing to do with the dataset, I’ve only seen the five public questions, and was basing my comments on those. I also said “In my opinion, currently, AI is a long way away from being able to do those questions … but I’ve been wrong before”. And then this week there’s an announcement that the language model o3 got a score of 25 percent on the dataset. I was shocked.
What exactly has happened here?
Why was I shocked? Because my mental model on where “AI” is currently, when it comes to doing mathematics, is “undergrad or pre-undergrad”. It’s getting very good at “Olympiad-style” problems of the sort given to bright high-schoolers. Within a year it’s absolutely clear that AI systems will be passing undergraduate mathematics exams (not least because when you’re setting an undergraduate mathematics exam you ideally need to make sure that you don’t fail 50 percent of the class, so you throw in a couple of standard questions which are very similar to questions that the students have seen already, to ensure that those with a basic understanding of the course will pass the exam. Machines will easily be able to ace such questions). But the jump from that to having innovative ideas at advanced undergrad/early PhD level beyond recycling standard ideas seems to me to be quite a big one. For example I was very unimpressed by the ChatGPT answers to the recent Putnam exam posted here — as far as I can see only question B4 was answered adequately by the machine, most other answers are worth one or two out of 10 at most. So I was expecting this dataset to remain pretty unattackable for a couple of years.
My initial excitement was tempered however by a post from Elliot Glazer from Epoch AI on Reddit where he claimed that in fact 25 percent of the problems in the dataset were “IMO/undergrad style problems”. This claim is a little confusing because I would be hard pressed to apply such adjectives to any of the five publically-released problems in the dataset; even the simplest one used the Weil conjectures for curves (or a brute force argument which is probably just about possible but would be extremely painful, as it involves factoring 10^12 degree 3 polynomials over a finite field, although this could certainly be parallelised). This of course raises questions in my mind about what the actual level of the problems in this secret dataset is (or equivalently whether the five public questions are actually a representative sample), but this is not knowledge which we’re likely to have access to. Given this new piece of information that 25 percent of the problems are undergraduate level, perhaps I will revert to being unsurprised again, but will look forward to being surprised when AI is getting nearer 50 percent on the dataset, because performance at “qual level” (as Elliot describes it — the next 50 percent of the questions) is exactly what I’m waiting to see from these systems — for me this would represent a big breakthrough.
Prove this theorem!
However, as Borcherds points out, even if we ended up with a machine which was super-human at “find this number!” questions, it would still have limited applicability in many areas of research mathematics, where the key question of interest is usually how to “prove this theorem!”. In my mind, the biggest success story in 2024 is DeepMind’s AlphaProof, which solved four out of the six 2024 IMO (International Mathematics Olympiad) problems. These were either “prove this theorem!” or “find a number and furthermore prove that it’s the right number” questions and for three of them, the output of the machine was a fully formalized Lean proof. Lean is an interactive theorem prover with a solid mathematics library mathlib containing many of the techniques needed to solve IMO problems and a lot more besides; DeepMind’s system’s solutions were human-checked and verified to be “full marks” solutions. However, we are back at high school level again; whilst the questions are extremely hard, the solutions use only school-level techniques. In 2025 I’m sure we’ll see machines performing at gold level standard in the IMO. However this now forces us to open up the “grading” can of worms which I’ve already mentioned once, and I’ll finish this post by talking a little more about it.
Who is marking the machines?
July 2025. I can envisage the following situation. As well as hundreds of the world’s smartest schoolchildren entering the IMO, there will be machines entering. Hopefully not too many though. Because the systems will be of two types. There will be systems submitting answers in the language of a computer proof checker like Lean (or Rocq, Isabelle, or many others). And there will be language models submitting answers in human language. The big difference between these two submissions are that: if a marker verifies that the statement of the question has been correctly translated into the computer proof checker, then all they need to do is to check that the proof compiles and then they basically know that it is a “full marks” solution. For the language models we will have a situation like the poor Putnam solutions above — the computer will write something, it will look convincing, but a human is going to have to read it carefully and grade it, and there is certainly no guarantee that it will be a “full marks” solution. Borcherds is right to remind the AI community that “prove this theorem!” is what we really want to see as mathematicians, and language models are currently at least an order of magnitude less accurate than expert humans when it comes to logical reasoning. I am dreading the inevitable onslaught in a year or two of language model “proofs” of the Riemann hypothesis which will just contain claims which are vague or inaccurate in the middle of 10 pages of correct mathematics which the human will have to wade through to find the line which doesn’t hold up. On the other hand, theorem provers are at least an order of magnitude more accurate: every time I’ve seen Lean not accept a human argument in the mathematical literature, the human has been wrong.
In fact, as mathematicians, we would like to see more than “prove this theorem!”. We would like to see “prove this theorem, correctly, and explain what makes the proof work in a way which we humans understand”. With the language model approach I worry (a lot) about “correctly” and with the theorem prover approach I worry about “in a way which we humans understand”. There is still a huge amount to be done. Progress is currently happening really quickly. But we are a long way away. When will we “beat the undergraduate barrier”? Nobody knows.