Training hangs on H20 GPU after 1000+ iterations (A100 used in original setup)

Hi, thank you for the great work and the open-sourced code.

I'm currently trying to reproduce your baseline results using an H20 GPU. However, I noticed that the original paper used an A100 GPU, and due to driver and compatibility constraints on my side, I’m using a slightly different version of PyTorch than the one specified in your environment.

The issue I'm encountering is that training proceeds normally for the first few hundred iterations, but then hangs (freezes) around iteration 1000–2000, without any error message or crash. The GPU usage flatlines, and no forward/backward steps are completed. I’ve verified that this is not due to out-of-memory issues or data loading bottlenecks.

I suspect this may be due to hardware or framework-level differences (e.g., CUDA kernel incompatibility between A100 and H20), but I wanted to check whether anyone else has encountered similar behavior or if you have any suggestions (e.g., known incompatibilities, or workaround options).

Here’s a brief summary of my setup:

GPU: NVIDIA H20

PyTorch: 2.6.0+cu122

CUDA: 12.2

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training hangs on H20 GPU after 1000+ iterations (A100 used in original setup) #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training hangs on H20 GPU after 1000+ iterations (A100 used in original setup) #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions