-
Notifications
You must be signed in to change notification settings - Fork 607
fix: enable opt for cutlass sources to avoid infinite compile time #2595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0e5eb5f to
02eb82c
Compare
Greptile SummaryAdded Key changes:
Rationale: Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant CMake as CMake Build System
participant NVCC as NVCC Compiler
participant CUTLASS as CUTLASS Kernels
Note over CMake: Set CUTLASS_KERNEL_SOURCES list
Note over CMake: - cutlass_grouped_gemm.cu<br/>- hadamard_transform kernels (3 files)
CMake->>NVCC: Compile cutlass_grouped_gemm.cu
Note over NVCC: Apply flags:<br/>--generate-code=arch=compute_90a,code=sm_90a<br/>-g0 (disable debug info)<br/>-dopt=on (enable optimizations)
NVCC->>CUTLASS: Process CUTLASS templates with optimization
Note over CUTLASS: Optimizations prevent<br/>infinite compile time
CUTLASS-->>NVCC: Optimized PTX (manageable size)
NVCC-->>CMake: Compiled CUBIN
loop For each Hadamard transform kernel
CMake->>NVCC: Compile hadamard_transform_*.cu
Note over NVCC: Apply flags:<br/>-g0<br/>-dopt=on
NVCC->>CUTLASS: Process with optimization
CUTLASS-->>NVCC: Optimized PTX
NVCC-->>CMake: Compiled CUBIN
end
Note over CMake: All CUTLASS kernels<br/>compiled successfully<br/>in reasonable time
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
c73a260 to
116bf35
Compare
Signed-off-by: Kaining Zhong <[email protected]>
116bf35 to
27ef4d9
Compare
|
/te-ci |
Description
Previously if you use NVTE_BUILD_DEBUG=1 to get a debug build for TE. it takes forever to build
cutlass_grouped_gemm.cuandhadamard_transform_cast_fusion.cubecause they use cutlass and without compiler optimization, the.ptxfiles end up being hundreds of MBs andptxaswill use a lot of time to assemble them to.cubinFixes # (issue)
Type of change
Changes
For source files that use cutlass, use
-dopt=onto enable optimizations. This will significantly reduce the build time.I tested this by build on a L40 server with 32 AMD cores. I can successfully build TE with
Total time for bdist_wheel: 399.87 secondsAnd can confirm it is a debug build:
Checklist: