Training Thousands of Lora Adapters at Once

TL;DR: We extended Miles with a multi-adapter LoRA training path that lets us train thousands of LoRA adapters concurrently and asynchronously. The core changes are a modification to Megatron which allows us to load multiple LoRA adapters as a single matrix and a "paged adapter table" which allows us to schedule new training runs without interrupting existing ones. On our Qwen3.5-35B-A3B + GSM8K stress test, we ran 4,096 LoRA adapter instances concurrently with step time under 5 minutes.

Overview

LoRA-based post-training decomposes a model into two components: a shared base model and a lightweight low-rank adapter that captures task-specific updates. This setup works well when training a single policy, since only the adapter parameters need to be optimized. However, it fails to scale when many policies are fine-tuned in parallel. Each LoRA training run still requires deployment of the full base model alongside its adapter, causing the same underlying model weights to be replicated across concurrent runs. As the number of policies scales, this duplication introduces substantial wasted VRAM.

Concurrent users by training method

What if we could share the same base model between policies, and just fine-tune different LoRA adapters in a single batch? This is cleaner and solves scalability: we can keep one base model, route tokens to different LoRA adapters, and have the training/inference stack treat LoRA adapters as cheap concurrent policies rather than separate model replicas.

We built on our prior work of supporting LoRA RL for the Qwen3.5 model family, with the goal of extending our training stack from "one LoRA policy" to "many LoRA policies, in the same training step."

The motivation is simple: base models are large, LoRA adapters are small. If we want to run thousands of RL experiments (i.e. prompt/harness, reward design, and curriculum ablations), we can't replicate the full base model for every individual training run.

We built our multi-LoRA framework on top of Miles, RadixArk's continuously evolving open source RL post-training framework. Miles already provides us the pieces we require for large scale RL, such as:

  • Megatron-based training with support for flexible modifications
  • SGLang-based rollout with support for scaling to thousands of LoRA adapters
  • Unified FP8 training support

We added multi-LoRA training by:

  • Deploying one shared Qwen3.5-35B-A3B base model
  • Supporting many adapter slots (hundreds per replica, thousands per cluster) through online loading and unloading
  • Building a dataloader that routes examples to different LoRA adapters within a batch
  • Serving multiple LoRA adapters using SGLang's native multi-LoRA interface

Implementation

Multi-Adapter LoRA in Megatron-Bridge

We began by adding a new multi-LoRA transform within Megatron-Bridge's PEFT module. The new transform, MultiLoRA, behaves similar to a normal PEFT transform by attaching the LoRA weights and freezing the layer. However, instead of wrapping a target module with one adapter, it wraps it with N adapters, which we refer to as "slots."

The MultiLoRA transform targets the most transformer linear modules (linear_qkv, linear_proj, linear_fc1, and linear_fc2) but deliberately skips MoE expert linear and router layers.

The expert layer limitation is caused by the nature of MoE layers and routing. While training with multiple LoRAs, we keep sequences grouped by LoRA within a batch. However, when we encounter an expert activation, we must break this grouping to route the tokens to its expert and then reconstruct it for future layers. The reconstruction of the grouped sequences requires extra patching and bookkeeping in Megatron-Bridge that we plan to work on in the future.

This keeps the first version focused on dense transformer projections, where LoRA has the cleanest tensor-parallel semantics.

At each matched module, we replace the base linear with a MultiLoRALinear. Internally, each wrapped layer owns:

  • The original Megatron parallel linear
  • An nn.ModuleList of LoRA adapter slots
  • Per-slot rank and alpha metadata
  • A tokens_per_adapter_slot routing tensor
  • Compatibility helpers for exposing a single slot as .adapter for checkpoint export/load

The key idea is that sequences generated from an adapter must be processed by the same during training. Before the forward pass, we can pre-process the incoming micro-batch by grouping by adapter slot, and then sorting it by slot. Then, each MultiLoRALinear receives a tokens_per_adapter_slot vector indicating how many contiguous tokens belong to each slot:

x = [slot0 tokens | slot1 tokens | slot2 tokens | ... | slotN tokens ]
tokens_per_adapter_slot = [n0, n1, n2, ..., nN]

Since the tokens are grouped together with their off-sets known, we can use a grouped matrix multiply for both the LoRA A and LoRA B matrices. Our modified forward pass looks like this:

Multi-LoRA linear forward pass

Rank Masking and Slot Lifecycle

Our multi-LoRA trainer also supports training on different ranks up to a maximum rank (R) which is set at initialization time of MultiLoRA. For ranks that are lower than the maximum, we zero out the unused rows and columns. This simple implementation plays well with other moving parts of the framework such as the optimizer, but likely has room for further optimization.

Slot lifecycle is controlled with small model-wide helpers:

  • init_adapter_slot(model, idx, rank, alpha)
  • clear_adapter_slot(model, idx)
  • load_adapter(model, idx, state_dict)
  • set_tokens_per_adapter_slot(model, tokens_per_adapter)

The goal of the slot lifecycle helpers is to provide slot orchestration as a user API while keeping the layer responsible for the local invariants.

Grouped GEMM Forward

The base model runs normally for each target linear. We add the LoRA path on top:

$$ y = W x + \Delta_a x $$

Where each LoRA adapter still computes:

$$ \Delta_a = \frac{\alpha_a}{r_a} B_a A_a $$

A naive way to compute the per adapter output is to iterate over each adapter:

for adapter in adapters:
    out[adapter_tokens] = B_a(A_a(x[adapter_tokens]))

However, we can't process the LoRA adapters sequentially, as invoking kernel launches for small GEMMs leads to poor scalability.

Instead, MultiLoRALinear stacks the raw LoRA weights:

stacked_A = torch.stack([
    a.linear_in.weight
    for a in adapters
])
stacked_B = torch.stack([
    a.linear_out.weight
    for a in adapters
])

And uses torch._grouped_mm with offsets derived from tokens_per_adapter_slot.cumsum(), like so:

$$ m_a = x_a A_a^\top $$

$$ o_a = m_a B_a^\top $$

That being said, all adapters are processed as one grouped operation over packed token ranges.

The tensor-parallel and sequence-parallel collectives are issued once for the grouped GEMMs, matching the base linear layout:

  • Column-parallel layers gather sequence-parallel input before LoRA matmul
  • Row-parallel layers reduce partial LoRA hidden states between A and B
  • Output layout is gathered or scattered to match the wrapped Megatron linear

This matters because we need to preserve Megatron's sharding semantics in order to properly use multi-LoRA. The wrapper should be invisible to the rest of the model: the caller still sees the same (linear_output, bias) structure.

Checkpoint Compatibility

One awkward systems detail is that existing Megatron-Bridge export paths expect a single adapter named .adapter, while our multi-adapter layer stores multiple adapters under .adapters.

Rather than implement our own export logic, we add a context manager expose_adapter_slot(model, idx) that temporarily exposes a target slot as .adapter. This allows us to reuse existing save/load/export code.

module.adapters[idx] -> module.adapter

We also added hide_adapters(model) for base checkpoint loading, so Megatron-Bridge does not try to map adapter parameters to Hugging Face base model weights.

Why This Works

GRPO-style RL objectives already separate the trainable policy from a frozen reference model. In the DeepSeekMath/GRPO objective, the policy update is regularized by a KL term against $\pi_{\text{ref}}$:

$$ \mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min(\cdots) - \beta D_{\text{KL}} \left[ \pi_\theta \Vert \pi_{\text{ref}} \right] \right) \right] $$

For LoRA policy $a$, the model parameters are:

$$ \theta_a = \theta_0 + \Delta_a $$

where $\theta_0$ is the shared base model and $\Delta_a$ is the adapter-specific LoRA delta. If every LoRA adapter starts from the same base model, then the correct frozen reference for every one is the same model:

$$ \pi_{\text{ref}, a} = \pi_{\theta_0} $$

The policy logprob depends on the adapter:

$$ \log \pi_a(y_t \mid s_t) = \log \pi_{\theta_0 + \Delta_a}(y_t \mid s_t) $$

But the reference logprob does not:

$$ \log \pi_{\text{ref}}(y_t \mid s_t) = \log \pi_{\theta_0}(y_t \mid s_t) $$

Using the common sampled KL estimator from GRPO implementations, define:

$$ r_{a,t} = \frac{ \pi_{\text{ref}}(y_t \mid s_t) }{ \pi_a(y_t \mid s_t) } = \exp\left( \log \pi_{\text{ref}}(y_t \mid s_t) - \log \pi_a(y_t \mid s_t) \right) $$

Then the token-level KL penalty is:

$$ \widehat{D}_{\text{KL}, a,t} = r_{a,t} - \log r_{a,t} - 1 $$

Only the policy logprob changes with adapter $a$. The reference logprob is shared across all adapters for the same token state. Therefore, one frozen reference model can score the packed batch once, and each adapter's policy loss can reuse those reference logprobs.

Crucially, this policy only holds when all adapters share the same reference checkpoint. If adapter slots are warm-started from different base models, or if each adapter has its own nonzero reference adapter, then the reference pass must be grouped by reference identity too.

Inference Path

SGLang's LoRA serving design already allows for optimized multi-LoRA serving by implementing concepts from Punica and S-LoRA. Punica batches multi-tenant LoRA requests against a shared base model using segmented matrix-vector kernels. S-LoRA extends this with unified paging for adapter weights and KV cache, along with tensor-parallel strategies that enables scaling to thousands of adapters with low overhead.

For the most part, multi-LoRA serving works out of the box for RL post-training. SGLang already supports:

  • Multiple adapters in the same batch
  • Adapters per batch using max_loras_per_batch
  • CPU-side loaded adapter limits
  • Both triton and csgmv LoRA back-ends

So the inference-side work became mostly scaling and integration by:

  • Running more SGLang replicas
  • Sharding LoRA adapters across replicas
  • Routing rollout requests to the replica that owns the LoRA adapter
  • Keeping replica size at four GPUs per inference engine
  • Keeping trainer size at four GPUs per trainer

Results

To stress test our implementation, we used GSM8K as our training dataset to fine-tune Qwen3.5-35B-A3B LoRA policies. We tested up to 4,096 concurrent LoRA adapter instances live in the system. The training step time was under 5 minutes, roughly even between training and inference (~2.5 minutes each).

Our reference model and policy model show virtually no divergence in this setup. That is consistent with the structure of the experiment: every policy is a small LoRA delta off the same base model, and the shared reference remains the zero-adapter/base policy. The main value we need to report in the final version is the exact KL curve: mean token KL, p95 token KL, and ref-policy logprob delta over training.

The bottom line is that just training one LoRA adapter isn't enough. We wanted to ensure that thousands of policies can be trained as first-class RL actors without multiplying the base model footprint.

Challenges

The first challenge was lining up Megatron tensor-parallel semantics cleanly. LoRA is simple in single-GPU PyTorch, but Megatron linears are not all the same. Column-parallel and row-parallel layers need different collectives between the A and B projections. Sequence parallelism also changes whether the LoRA path sees full sequence tokens or a shard. The wrapper has to match the base linear exactly, or the residual addition becomes shape-correct but semantically wrong.

The second challenge was checkpoint compatibility. Our multi-LoRA implementation expects .adapters[idx]; existing Megatron-Bridge tooling expects .adapter. The temporary exposure context solved this without forcing a larger checkpoint format rewrite.

The third challenge was shape handling for the inference back-end. SGLang already had the right high-level support, but high-concurrency LoRA adapter serving creates back-end shape paths that small examples do not. We had to fix shape mismatches in both triton and csgmv paths before large packed batches worked reliably.

The final challenge was deciding what we shouldn't build. Punica and S-LoRA already solved a large part of the multi-LoRA adapter inference problem. Our leverage was higher on the training side: i.e. supporting the ability to train many LoRA adapters concurrently on Megatron, then using SGLang's existing multi-LoRA serving design at a larger replica count.

Conclusion

Multi-LoRA training turns large-scale RL from "train many separate models" into "train many deltas around one base model." Miles gives us the distributed RL skeleton: Megatron for training, SGLang for rollout, and synchronization between them. By adding a grouped-GEMM multi-LoRA path to Megatron-Bridge and leaning on SGLang's Punica/S-LoRA-inspired serving support, we can run thousands of Qwen3.5-35B-A3B LoRA policies concurrently.

We also see additional benefits from multi-LoRA training aside from the resource efficiency gains:

  • Online loading and unloading with multi-LoRA bypasses the cold-start time for spinning up large base models for fine-tuning on a cluster, allowing for faster iteration
  • Using multi-LoRA training can quickly scale up post-trained expert models for on-policy distillation, an increasingly popular post-training technique

At Osmosis, we enable developers to build task-specific models with RL that beat foundation models. Our multi-LoRA trainer is the backbone of our post-training platform. We plan to GA release our platform later this year. If you're exploring post-training, reach out for access to our research preview.