LoRA's Limitations: Head-to-Head with Full RL

Published on May 20, 2025

Low-Rank Adaptation (LoRA) has recently gained popularity due to its significantly lower computational cost compared to full-model post-training, while still achieving comparable performance on supervised fine tuning workloads. However, our experiments indicate that LoRA is less effective in reinforcement learning (RL) approaches such as GRPO. We observed that LoRA does not substantially improve real task performance, and efficiency gains from LoRA are offset by the additional training required for behaviors that quickly emerge through full RL.

As a result, applying LoRA when doing RL can actually be slower, worse AND more expensive to achieve similar performance without LoRA.

To illustrate this difference, we conducted a comparative study using two versions of the Qwen3-4B model, trained on the math problems from the DAPO-Math-17k-Processed dataset. We benchmarked their performance using a pre-split validation dataset and the widely recognized GSM8K dataset for a OOD (out-of-distribution) test, as this was not present during the post-training process. Both models were trained with GRPO - one with LoRA and one without.

To ensure an accurate comparison, we closely aligned experimental conditions using a modified version of the Unsloth LoRA GRPO notebook. Critical aspects, such as the reward policy implementation, learning rates, batch sizes, and the number of training steps, were kept largely similar.

The scripts and data pre-processing pipelines are also available here:

With LoRA

Full training script

Without LoRA

During the LoRA-based training, the model showed difficulties in converging, despite employing the same training parameters as the fully-trained model. Structured formatting behavior was unable to emerge, leading to ineffective training.

In addition, we notice that LoRA GRPO does not yield improvements in accuracy. We first rewarded the model for outputting the correct format - i.e. if we ask the model to output the answer after "####" like "#### 42", can the model follow the instruction:

If the model responds in the correct format, we give it a score of 3. The model is unable to adhere and learn the new format, causing us to terminate the run after an extended period of time.

The check_answer method rewards the model based on output accuracy. A correct answer receives a score of 5, an incorrect answer receives a score of -2.5, and if the answer is nonsensical then the model receives a score of -4.5. The reward clearly fails to improve over time, showing the model's inability to learn the presented math questions.

The implementation for the full RL setup didn't allow metric logging of separate reward functions, but we can still see the general trajectory of the training run:

If the model fails to respond with the correct format, it receives a score of 0. We see that we're able to get consistent format behavior by step 9 - less than 10% of the training required for LoRA. And the model improvement indicates a gradual learning process, compared to the stagnant reward in the LoRA training run.

We then benchmarked both models (as well as the base Qwen3-4B model) on the pre-split validation set, validating its performance on both instruction following and arriving at the correct answer.

While we see significant improvement in the full RL trained model, the LoRA model effectively performs the same as the base model - no improvement.

We also benchmarked the two models on OOD data in the GSM8K dataset, with the same formatting expectation. Again, no improvement:

We conclude that LoRA's performance makes it difficult to justify in production environments - full RL results in better performance faster, and would also likely be cheaper given the longer training runs required for LoRA to achieve similar behavior.

LoRA is generally thought to be a cheaper alternative. For this experiment, we used Unsloth (only supports single GPU currently), which took over 30 hours to train the LoRA model. For the full RL model, we used 8xH100s and the total training time was around 140 minutes using VeRL and SGLang - i.e. ~19 hours of GPU time.

It's also worth noting that we ran the original Unsloth notebook as-is in order to confirm the hyperparameters worked (i.e. verify that the model is capable of learning, with SFT). This run is also available in our run log. While the model was able to learn the output format with SFT, the reward on output correctness indicates an inability for the model to improve accuracy.

Full GRPO was 40% cheaper, 12X faster, AND 50% better than LoRA GRPO on OOD data.

This is why we're building Osmosis, a platform that enables AI self-improvement through real-time reinforcement learning. Reinforcement learning can be daunting. We make the most advanced training techniques (inc. online learning!) accessible to everyone.

If you're interested in learning more, we'd love to chat!

Full training logs and detailed analysis are publicly accessible via Weights & Biases.