Learning to Do Math With LLMs

I want to comment on the recent results achieved by OpenAI’s new o1 model family. Specifically, I will be dicussing formal verification, which can be applied to mathematics and computer programming.

The name of this post is inspired by their September 12 2024 blog post titled “Learning to Reason with LLMs”. The abstract reads:

“We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.”

From the information that is publicly available, it is likely that o1 was on datasets that were generated by humans. My justification for includes the following:

Formal verification languages are not explicitly mentioned anywhere in the articles.
The o1 models are applied to numerous tasks, not just mathematics and programming. For instance, it also saw quite significant improvements in biology, professional law, and global facts, which are not associated with formal verification languages.
The examples of raw thought process provided for mathematics and programming do not involve any signs of formal languages being generated nor executed. Instead, the model simulates the verification process by generating a long chain of thoughts in natural language.

This leads me to the conclusion that the o1 family was trained using reinforcement learning on large datasets that almost certainly they did not make use of any formal verification processes to guarantee the correctness of their outputs at a large scale yet. (Note that I could be very wrong about this.)

Interpretation of the Reinforcement Learning Process

The reinforcement learning process is a way to train a model to perform a task by providing it with feedback in the form of rewards or punishments. The model learns to maximize the cumulative reward over time by interacting with the environment.

Thinking mathematically, we view reinforcement learning as an optimization process based on trial and error. Then, the main computation in the model is within the weights of the attention mechanism and the neural network layers.

Intuitively speaking, there are 2 approaches I see to the self-correction process of a model:

The attention mechanism is trained to focus on the correct answer and ignore the incorrect ones. In other words, it “forgets” the incorrect parts of information.
The attention mechanism focuses on the mistakes and expects the neural network to correct them.

From the example outputs, it’s quite clear that the second approach prevailed in the case of o1. We are relying on the feedforward neural network to learn an internal representation of correct versus incorrect outputs and use it to correct itself.

Interestingly, the model tends to output the tokens “wait” and “but” similarly to a human at each reasoning step. Some exerpts from the math example:

But wait, perhaps it’s better to rearrange. … Wait, but from the definition of $s(x)$, we have: … Wait, actually, this may not help us directly without specific terms. … But this seems to contradict our prior assertion that $s(x)=−x ^{2n+2}+a_0 x^{2n}+…$. Wait, perhaps we need to be careful here. … Wait, but that seems messy. … But after simplification, the leading term is $cx^{2n}$. Wait, I think this is getting too messy.

There are a lot of filler tokens, which could fill up the model’s attention mechanism and derail its consistency. This tends to happen a lot in LLMs. Figuring out how to apply approach #1 (attention masks/gating/whatever) appropriately could mitigate this issue.

In the math example, this is shown by GPT-4o getting the correct answer but o1 failing. Here is the given problem:

Let $n$ be an even positive integer. Let $p$ be a monic, real polynomial of degree $2n$; that is to say,
$$p(x)=x^{2n}+a_{2n-1}x^{2n-1}+\dots+a_1x+a_0$$
for some real coefficients $a_0,\dots,a_{2n-1}$. Suppose that $p(\frac{1}{k})=k^2$ for all integers $k$ such that $1\leq|k|\leq n$.
Find all other real numbers $x$ for which $p(\frac{1}{x})=x^2$.

Here is my solution:

Click to expand solution

The set of integers $k$ such that $1\leq|k|\leq n$ is ${-n,\dots,2,-1,1,2,\dots,n}$. There are $2n$ such integers.

Consider the rational function $q(x)$ defined as:

$$q(x):=p\left(\frac{1}{x}\right)-x^2$$

Based on the given assumptions, it has zeroes at $k$ for $k\in{-n,\dots,2,-1,1,2,\dots,n}$.

Now, we can construct the polynomial $s(x)$ as:

$$s(x):=x^{2n}q(x)$$

This polynomial has degree $2n$. In addition, because it has a factor of $q(x)$, it also has the integer roots $k$ for $k\in{-n,\dots,2,-1,1,2,\dots,n}$. However, by the fundamental theorem of algebra, a polynomial of degree $d$ has at most $d$ real roots. This implies there are no other possible real roots for $s(x)$.

Both models started off strongly, but they soon both derailed and used incorrect reasoning. GPT-4o was able to find the correct answer, while o1 was unable to do so. This is a clear indication that excessively long chain of thought saturating the context window can degrade the performance of a model.

Integrating Formal Verification

Several works have shown that LLMs can be used to generate formal mathematical proofs in programming languages such as Lean. For instance, those from OpenAI, Google’s AlphaProof and DeepSeek’s DeepSeek-Prover.

In particular, the DeepSeek-Prover-V1.5 paper makes use of scaling inference-time compute to try to achieve better results. Notably, they used Monte Carlo Tree Search, on top of training using reinforcement learning.

In the paper, they use the Lean 4 theorem prover to check proofs. Their approach is to trim the context of the model of all output after a detected error to force it to backtrack and try a different path. This is therefore one possible form of the previously mentioned “forgetting wrong parts” approach.

However, in my view, the results were rather disappointing in terms of the scale increase. I believe this is due to the model used being too small, having only 7B parameters and therefore its performance essentially saturates.

The new results from OpenAI’s o1 model family show that this increase in compute is extremely effective when scaled up. Even a simple chain-of-thought process after training with reinforcement learning leads to incredible increases in performance without all the Monte Carlo Tree Search approaches.

Therefore, it remains to scale up the model with a methodology similar to the one used in the DeepSeek-Prover-V1.5 paper. The introduction of a formal verification process allows the model to generate unlimited amounts of synthetic data, which can be used to train the model further and improve its performance.

Indeed, a verifier process such as Lean 4 allows immediate and precise feedback on the correctness of each step of the proof. Therefore, there are not the issues of the model making mistakes when generating new data.

I’m very excited to see what will be achieved in the future with this approach.

Interpretation of the Reinforcement Learning Process#

Integrating Formal Verification#

Interpretation of the Reinforcement Learning Process

Integrating Formal Verification