Skip to content

Conversation

@yukirora
Copy link
Contributor

@yukirora yukirora commented Sep 8, 2025

Description
gpu burn: collect per-snapshot per-GPU flops/temp and add summary metrics

Major Revision

  • Parse all performance snapshot lines containing "Gflop/s" and record per-snapshot, per-GPU metrics: gpu_<snap_idx>gflops:<gpu_index> and gpu<snap_idx>_temp:<gpu_index>
  • Aggregate per-GPU statistics across snapshots:
    Per-GPU average flops: gpu_avg_gflops:<gpu_index>
    Per-GPU flops variability metric: gpu_var_gflops:<gpu_index> (simple max-min based metric)
    Per-GPU max temperature: gpu_max_temp:<gpu_index>

@yukirora yukirora requested a review from a team as a code owner September 8, 2025 11:16
@yukirora yukirora added the benchmarks SuperBench Benchmarks label Sep 8, 2025
@codecov
Copy link

codecov bot commented Sep 9, 2025

Codecov Report

❌ Patch coverage is 95.23810% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.88%. Comparing base (0b4311c) to head (80eecae).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
...bench/benchmarks/micro_benchmarks/gpu_burn_test.py 95.23% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #735      +/-   ##
==========================================
- Coverage   86.47%   85.88%   -0.59%     
==========================================
  Files         102      102              
  Lines        7541     8204     +663     
==========================================
+ Hits         6521     7046     +525     
- Misses       1020     1158     +138     
Flag Coverage Δ
cpu-python3.10-unit-test 71.09% <95.23%> (-0.51%) ⬇️
cpu-python3.12-unit-test 71.09% <95.23%> (-0.51%) ⬇️
cpu-python3.7-unit-test 70.54% <95.23%> (-0.12%) ⬇️
cuda-unit-test 83.67% <95.23%> (-0.32%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cp5555 cp5555 added the micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks label Sep 18, 2025
per_gpu_flops[i] = []
if i not in per_gpu_temps:
per_gpu_temps[i] = []
if i < len(gflops) and gflops[i] > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check i < len(gflops)? I think it's not your goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is to parse one line result, which looks like
50.0% proc'd: 2261 (7150 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) errors: 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 temps: 59 C - 56 C - 55 C - 57 C - 56 C - 37 C - 38 C - 39 C, the check(i < len(gflops)) is in case that one line missed value for any gpu so len(gflops) < num_gpus

or do you think if num_gpus > len(gflops), I should skip this line and set error return code?

per_gpu_flops[i].append(gflops[i])
else:
self._result.add_result(f'gpu_{snap_idx}_gflops:{i}', 0.0)
if i < len(temps):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same question as previous one.

assert (benchmark._process_raw_result(0, results))
assert (benchmark.result['return_code'][0] == 0)
assert (benchmark.result['time'][0] == time)
for device in range(8):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some correctness check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated correctness check according to current static test data file

@guoshzhao guoshzhao mentioned this pull request Oct 2, 2025
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarks SuperBench Benchmarks micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants