Skip to content

Conversation

@kainzhong
Copy link
Collaborator

@kainzhong kainzhong commented Jan 14, 2026

Description

Previously if you use NVTE_BUILD_DEBUG=1 to get a debug build for TE. it takes forever to build cutlass_grouped_gemm.cu and hadamard_transform_cast_fusion.cu because they use cutlass and without compiler optimization, the .ptx files end up being hundreds of MBs and ptxas will use a lot of time to assemble them to .cubin

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

For source files that use cutlass, use -dopt=on to enable optimizations. This will significantly reduce the build time.

I tested this by build on a L40 server with 32 AMD cores. I can successfully build TE with Total time for bdist_wheel: 399.87 seconds

And can confirm it is a debug build:

$ file build/lib.linux-x86_64-cpython-312/transformer_engine/libtransformer_engine.so
build/lib.linux-x86_64-cpython-312/transformer_engine/libtransformer_engine.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=0fc65e16d808a5f807dc02b0ebc970019b1cc8db, with debug_info, not stripped

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@kainzhong kainzhong force-pushed the fix/enable_opt_cutlass branch from 0e5eb5f to 02eb82c Compare January 14, 2026 01:06
@kainzhong kainzhong marked this pull request as ready for review January 14, 2026 01:19
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 14, 2026

Greptile Summary

Added -dopt=on flag to CUTLASS kernel sources to enable device code optimizations during debug builds, preventing infinite compilation times.

Key changes:

  • Expanded optimization flags from just cutlass_grouped_gemm.cu to all 4 CUTLASS kernel files
  • Added -dopt=on flag alongside existing -g0 to enable optimizations while disabling debug symbols
  • Refactored flag application: SM90a architecture flag now separate from debug/optimization flags for cleaner organization
  • Files affected: cutlass_grouped_gemm.cu and 3 hadamard transform fusion kernels

Rationale:
CUTLASS templates generate massive PTX files (hundreds of MB) without optimization, causing ptxas assembly to hang indefinitely. The -dopt=on flag enables device optimizations even in debug mode (NVTE_BUILD_DEBUG=1), reducing PTX size to manageable levels. The author confirmed successful build in ~400 seconds with debug symbols preserved in the rest of the codebase.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it's a targeted build optimization fix
  • The change is narrowly scoped to compilation flags for specific CUTLASS kernel files, addresses a real build time issue (infinite compile time in debug mode), uses valid NVCC flags, and has been tested successfully by the author. The flags only affect 4 specific source files and don't alter runtime behavior.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Adds -dopt=on flag to CUTLASS kernel sources to enable optimizations in debug builds, preventing infinite compile times. Refactors compilation flags for better organization.

Sequence Diagram

sequenceDiagram
    participant CMake as CMake Build System
    participant NVCC as NVCC Compiler
    participant CUTLASS as CUTLASS Kernels
    
    Note over CMake: Set CUTLASS_KERNEL_SOURCES list
    Note over CMake: - cutlass_grouped_gemm.cu<br/>- hadamard_transform kernels (3 files)
    
    CMake->>NVCC: Compile cutlass_grouped_gemm.cu
    Note over NVCC: Apply flags:<br/>--generate-code=arch=compute_90a,code=sm_90a<br/>-g0 (disable debug info)<br/>-dopt=on (enable optimizations)
    
    NVCC->>CUTLASS: Process CUTLASS templates with optimization
    Note over CUTLASS: Optimizations prevent<br/>infinite compile time
    CUTLASS-->>NVCC: Optimized PTX (manageable size)
    NVCC-->>CMake: Compiled CUBIN
    
    loop For each Hadamard transform kernel
        CMake->>NVCC: Compile hadamard_transform_*.cu
        Note over NVCC: Apply flags:<br/>-g0<br/>-dopt=on
        NVCC->>CUTLASS: Process with optimization
        CUTLASS-->>NVCC: Optimized PTX
        NVCC-->>CMake: Compiled CUBIN
    end
    
    Note over CMake: All CUTLASS kernels<br/>compiled successfully<br/>in reasonable time
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 14, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@timmoon10 timmoon10 self-requested a review January 14, 2026 02:03
@timmoon10 timmoon10 added the build Build system label Jan 14, 2026
@kainzhong kainzhong force-pushed the fix/enable_opt_cutlass branch 3 times, most recently from c73a260 to 116bf35 Compare January 15, 2026 01:30
@kainzhong kainzhong force-pushed the fix/enable_opt_cutlass branch from 116bf35 to 27ef4d9 Compare January 15, 2026 01:32
@ptrendx
Copy link
Member

ptrendx commented Jan 15, 2026

/te-ci

@timmoon10 timmoon10 merged commit 6a34b65 into NVIDIA:main Jan 15, 2026
39 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Build system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants