Skip to content

Training hangs on H20 GPU after 1000+ iterations (A100 used in original setup) #7

@gulizhoutao

Description

@gulizhoutao

Hi, thank you for the great work and the open-sourced code.

I'm currently trying to reproduce your baseline results using an H20 GPU. However, I noticed that the original paper used an A100 GPU, and due to driver and compatibility constraints on my side, I’m using a slightly different version of PyTorch than the one specified in your environment.

The issue I'm encountering is that training proceeds normally for the first few hundred iterations, but then hangs (freezes) around iteration 1000–2000, without any error message or crash. The GPU usage flatlines, and no forward/backward steps are completed. I’ve verified that this is not due to out-of-memory issues or data loading bottlenecks.

I suspect this may be due to hardware or framework-level differences (e.g., CUDA kernel incompatibility between A100 and H20), but I wanted to check whether anyone else has encountered similar behavior or if you have any suggestions (e.g., known incompatibilities, or workaround options).

Here’s a brief summary of my setup:

GPU: NVIDIA H20

PyTorch: 2.6.0+cu122

CUDA: 12.2

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions