Refactor `rl_offload_kv_cache_during_training` to offload KV cache to CPU while retaining fixed virtual address #3048

mathemakitten · 2026-01-22T19:58:19Z

What does this PR do ?

Previously, setting args.rl_offload_kv_cache_during_training would offload the KV cache to CPU to save memory while switching to training mode and reload it back onto device. The new KV cache tensor would be instantiated with a new memory address, which necessitated rebuilding inference cudagraphs on every step.

Orthogonally, training cudagraphs previously could only be used when the KV cache, inference activations, and training activations could all be co-located within HBM.

Now, we suspend the physical memory underlying the KV cache tensor and offload the tensor contents to CPU when transitioning from inference to training, using torch_memory_saver. Notably, the virtual memory address of the KV cache tensor is kept consistent. Training cudagraphs can now be captured while accounting for this offload: since the KV cache tensor has been offloaded with torch_memory_saver.pause(), its CUDA allocation is released back to PyTorch’s caching allocator, so it is safe to reuse this memory for other training-side tensors.

When transitioning from training to inference, physical memory is simply re-bound to the same virtual address. This means that cudagraphs no longer need to be disabled between phases, since the KV cache has a consistent virtual address despite being offloaded/onloaded. This is compliant with partial rollouts.

The result is a minor increase in max reserved memory, with lower peak allocated memory and a consistent memory usage pattern over many iterations. This replaces the old behaviour of args.rl_offload_kv_cache_during_training.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

megatron/core/inference/engines/dynamic_engine.py

tdene · 2026-01-22T21:17:30Z

For now, can we preserve both the UVM and torch memory saver functionality, since it seems like very minimal support to keep both?

You'd have to make the torch memory saver context not be used if UVM is turned on by directly changing this block in dyamic_context.py:

        ctx_manager = (
            torch.cuda.use_mem_pool(self.unified_memory_mempool)
            if self.unified_memory_level > 0
            else nullcontext()
        )

to make the else portion be torch_memory_saver.region(tag="inference_context", enable_cpu_backup=True)

And you'd have to add to the if statement in the argument validation:

    if args.rl_offload_kv_cache_during_training and not args.inference_dynamic_batching_unified_memory_level:
        try:
            from torch_memory_saver import torch_memory_saver
        except ImportError:
            raise AssertionError("To use offload-kv-cache-during-training, `torch_memory_saver` must be installed. See https://github.com/fzyzcjy/torch_memory_saver.")

EDIT: Mainly because I think we should run perf benchmarks with both options to check there's no unexpected behavior.

…d add guards

wdykas · 2026-01-26T20:41:42Z

megatron/training/arguments.py

    args.data_parallel_size = args.world_size // total_model_size

+    # Assert that `torch_memory_saver` is installed if offloading KV cache during RL.
+    if args.rl_offload_kv_cache_during_training:


Probably want to add import check somewhere in core as well since its used there in the context.

We also do a check in megatron/core/inference/contexts/dynamic_context.py!

https://github.com/NVIDIA/Megatron-LM/pull/3048/files#diff-aa4eebdd4ac24bc9e7937775652f49028a0c648aab781dcc8ff0f243a4abaf1cR61

yeah but shouldn't we raise an error if someone thinks we are using it? Seems like secretly moving to a null context if its not available will give users unexpected behavior.

I've added an assert not self.offload_kv_cache and self.unified_memory_level in megatron/core/inference/contexts/dynamic_context.py for core. The check for self.offload_kv_cache is now sufficient; we need to keep the nullcontext path because we use it in the unified memory case.

wdykas

small comments but LGTM

mathemakitten added 3 commits January 22, 2026 11:56

Record inference allocator saved state

0044f2f

this runs with reasonable batch sizes

e964af9

Cleanup, guard import, replace rl_offload_kv_cache_during_training

f521e2d

mathemakitten requested review from a team as code owners January 22, 2026 19:58

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 19:58 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 22, 2026

ko3n1g requested a review from a team January 22, 2026 19:58

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 19:58 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 19:58 Inactive

copy-pr-bot bot had a problem deploying to test January 22, 2026 19:59 Error

Remove extra arg

5ac0b44

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 20:06 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 20:07 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 20:07 Inactive

copy-pr-bot bot temporarily deployed to test January 22, 2026 20:07 Inactive

wdykas reviewed Jan 22, 2026

View reviewed changes

megatron/core/inference/engines/dynamic_engine.py Show resolved Hide resolved

Add offload_kv_cache to DynamicInferenceContext to flip UVM on/off an…

f9632f7

…d add guards

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 21:34 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 22, 2026 21:35 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 22, 2026 21:35 Failure

copy-pr-bot bot temporarily deployed to test January 22, 2026 21:35 Inactive

tdene approved these changes Jan 23, 2026

View reviewed changes

wdykas reviewed Jan 26, 2026

View reviewed changes

wdykas approved these changes Jan 26, 2026

View reviewed changes

Assert not self.offload_kv_cache and self.unified_memory_level

ecf31ce

copy-pr-bot bot temporarily deployed to nemo-ci January 26, 2026 21:34 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 26, 2026 21:34 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 26, 2026 21:34 Inactive

copy-pr-bot bot temporarily deployed to test January 26, 2026 21:35 Inactive

shanmugamr1992 approved these changes Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `rl_offload_kv_cache_during_training` to offload KV cache to CPU while retaining fixed virtual address #3048

Refactor `rl_offload_kv_cache_during_training` to offload KV cache to CPU while retaining fixed virtual address #3048

mathemakitten commented Jan 22, 2026

Uh oh!

Uh oh!

tdene commented Jan 22, 2026 •

edited

Loading

Uh oh!

wdykas Jan 26, 2026

Uh oh!

mathemakitten Jan 26, 2026

Uh oh!

wdykas Jan 26, 2026

Uh oh!

mathemakitten Jan 26, 2026

Uh oh!

wdykas left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Refactor rl_offload_kv_cache_during_training to offload KV cache to CPU while retaining fixed virtual address #3048

Are you sure you want to change the base?

Refactor rl_offload_kv_cache_during_training to offload KV cache to CPU while retaining fixed virtual address #3048

Conversation

mathemakitten commented Jan 22, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

Uh oh!

tdene commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdykas Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

mathemakitten Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

wdykas Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

mathemakitten Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

wdykas left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Refactor `rl_offload_kv_cache_during_training` to offload KV cache to CPU while retaining fixed virtual address #3048

Refactor `rl_offload_kv_cache_during_training` to offload KV cache to CPU while retaining fixed virtual address #3048

(Step 1): Add PR label `Expert Review`

tdene commented Jan 22, 2026 •

edited

Loading