fix: enable opt for cutlass sources to avoid infinite compile time #2595

kainzhong · 2026-01-14T00:56:55Z

Description

Previously if you use NVTE_BUILD_DEBUG=1 to get a debug build for TE. it takes forever to build cutlass_grouped_gemm.cu and hadamard_transform_cast_fusion.cu because they use cutlass and without compiler optimization, the .ptx files end up being hundreds of MBs and ptxas will use a lot of time to assemble them to .cubin

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

For source files that use cutlass, use -dopt=on to enable optimizations. This will significantly reduce the build time.

I tested this by build on a L40 server with 32 AMD cores. I can successfully build TE with Total time for bdist_wheel: 399.87 seconds

And can confirm it is a debug build:

$ file build/lib.linux-x86_64-cpython-312/transformer_engine/libtransformer_engine.so
build/lib.linux-x86_64-cpython-312/transformer_engine/libtransformer_engine.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=0fc65e16d808a5f807dc02b0ebc970019b1cc8db, with debug_info, not stripped

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-01-14T01:22:22Z

Greptile Summary

Added -dopt=on flag to CUTLASS kernel sources to enable device code optimizations during debug builds, preventing infinite compilation times.

Key changes:

Expanded optimization flags from just cutlass_grouped_gemm.cu to all 4 CUTLASS kernel files
Added -dopt=on flag alongside existing -g0 to enable optimizations while disabling debug symbols
Refactored flag application: SM90a architecture flag now separate from debug/optimization flags for cleaner organization
Files affected: cutlass_grouped_gemm.cu and 3 hadamard transform fusion kernels

Rationale:
CUTLASS templates generate massive PTX files (hundreds of MB) without optimization, causing ptxas assembly to hang indefinitely. The -dopt=on flag enables device optimizations even in debug mode (NVTE_BUILD_DEBUG=1), reducing PTX size to manageable levels. The author confirmed successful build in ~400 seconds with debug symbols preserved in the rest of the codebase.

Confidence Score: 5/5

This PR is safe to merge with minimal risk - it's a targeted build optimization fix
The change is narrowly scoped to compilation flags for specific CUTLASS kernel files, addresses a real build time issue (infinite compile time in debug mode), uses valid NVCC flags, and has been tested successfully by the author. The flags only affect 4 specific source files and don't alter runtime behavior.
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/CMakeLists.txt	Adds `-dopt=on` flag to CUTLASS kernel sources to enable optimizations in debug builds, preventing infinite compile times. Refactors compilation flags for better organization.

Sequence Diagram

sequenceDiagram
    participant CMake as CMake Build System
    participant NVCC as NVCC Compiler
    participant CUTLASS as CUTLASS Kernels
    
    Note over CMake: Set CUTLASS_KERNEL_SOURCES list
    Note over CMake: - cutlass_grouped_gemm.cu<br/>- hadamard_transform kernels (3 files)
    
    CMake->>NVCC: Compile cutlass_grouped_gemm.cu
    Note over NVCC: Apply flags:<br/>--generate-code=arch=compute_90a,code=sm_90a<br/>-g0 (disable debug info)<br/>-dopt=on (enable optimizations)
    
    NVCC->>CUTLASS: Process CUTLASS templates with optimization
    Note over CUTLASS: Optimizations prevent<br/>infinite compile time
    CUTLASS-->>NVCC: Optimized PTX (manageable size)
    NVCC-->>CMake: Compiled CUBIN
    
    loop For each Hadamard transform kernel
        CMake->>NVCC: Compile hadamard_transform_*.cu
        Note over NVCC: Apply flags:<br/>-g0<br/>-dopt=on
        NVCC->>CUTLASS: Process with optimization
        CUTLASS-->>NVCC: Optimized PTX
        NVCC-->>CMake: Compiled CUBIN
    end
    
    Note over CMake: All CUTLASS kernels<br/>compiled successfully<br/>in reasonable time

greptile-apps · 2026-01-14T01:22:23Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Signed-off-by: Kaining Zhong <[email protected]>

ptrendx · 2026-01-15T01:50:40Z

/te-ci

kainzhong force-pushed the fix/enable_opt_cutlass branch from 0e5eb5f to 02eb82c Compare January 14, 2026 01:06

kainzhong marked this pull request as ready for review January 14, 2026 01:19

timmoon10 self-requested a review January 14, 2026 02:03

timmoon10 added the build Build system label Jan 14, 2026

kainzhong force-pushed the fix/enable_opt_cutlass branch 3 times, most recently from c73a260 to 116bf35 Compare January 15, 2026 01:30

fix: enable opt for cutlass sources to avoid infinite compile time

27ef4d9

Signed-off-by: Kaining Zhong <[email protected]>

kainzhong force-pushed the fix/enable_opt_cutlass branch from 116bf35 to 27ef4d9 Compare January 15, 2026 01:32

ptrendx approved these changes Jan 15, 2026

View reviewed changes

timmoon10 approved these changes Jan 15, 2026

View reviewed changes

timmoon10 merged commit 6a34b65 into NVIDIA:main Jan 15, 2026
39 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: enable opt for cutlass sources to avoid infinite compile time #2595

fix: enable opt for cutlass sources to avoid infinite compile time #2595

Uh oh!

kainzhong commented Jan 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 14, 2026

Uh oh!

ptrendx commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: enable opt for cutlass sources to avoid infinite compile time #2595

fix: enable opt for cutlass sources to avoid infinite compile time #2595

Uh oh!

Conversation

kainzhong commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 14, 2026

Greptile found no issues!

Uh oh!

ptrendx commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kainzhong commented Jan 14, 2026 •

edited

Loading

greptile-apps bot commented Jan 14, 2026 •

edited

Loading