Zhepei Wei†, Xinyu Zhu†, Wei-Lin Chen†, Chengsong Huang‡, Jiaxin Huang‡, Yu Meng

†University of Virginia ‡Washington University in St. Louis


<aside> 💡 TL;DR: We find that RLVR weight trajectories are extremely low-rank and highly predictable: (1) the majority of RLVR gains are captured by a rank-1 approximation of the parameter deltas, and (2) the magnitude of this rank-1 projection evolves near-linearly with training steps.

To exploit this structure, we propose RELEX (REinforcement Learning EXtrapolation), which first estimates the rank-1 subspace from a short observation window of RLVR training and then predicts future checkpoints via linear regression, with no learned model required.

This simple method shows promising potentials — using only 15–20% of RLVR training as observed prefix, RELEX matches or even surpasses full RLVR on both in-domain and out-of-domain evaluations across Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base.

</aside>

fig_teaser.png

Figure 1. RELEX produces checkpoints that match full RLVR performance based only on early training dynamics, without further training. RELEX first estimates the rank-1 update subspace from the observed RLVR prefix (up to $T_{cut}$), then extrapolates future checkpoints at no training cost, matching or exceeding the RLVR checkpoints on the MATH test set across three models.


Motivation

RLVR is powerful but expensive — a typical training run takes massive optimization steps and days of GPU time, even for moderately sized models. This raises a natural question: can we predict where RLVR training is heading from its early dynamics? If weight updates follow a structured, predictable pattern, we could extrapolate future checkpoints from a short prefix of training, potentially achieving the same level of performance as the fully trained model while saving most of the compute.

We start with two concrete questions:

The following two findings across three backbone models directly answer these questions.


Finding 1: RLVR weight updates are extremely low-rank

We train three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, Qwen3-8B-Base) with GRPO for 500 steps on MATH, saving checkpoints throughout training. For each checkpoint at step $t,$ we compute the weight delta $\Delta θ_t = θ_t − θ_0,$ and then perform SVD across the delta trajectory to get low-rank approximations of $\Delta \theta_t,$ which are used to reconstruct/approximate model weights at each step. The result is striking: a simple rank-1 approximation closely recovers the RLVR performance across training steps, as demonstrated in Figure 2.

qwen25_svd_reconstruct.png

Figure 2. Rank-1 SVD reconstruction closely recovers RLVR checkpoints across models. The rank-1 reconstructed checkpoints preserve most downstream performance on MATH, suggesting that a single dominant direction captures the task-relevant component of RLVR updates.

<aside> 💡 Finding 1 suggests that the task-relevant component of RLVR weight updates is highly concentrated. This low-rank structure is the first ingredient that makes RLVR checkpoint extrapolation possible.

</aside>