Skip to content

Conversation

@rickgrubin-noaa
Copy link
Collaborator

@rickgrubin-noaa rickgrubin-noaa commented Dec 30, 2025

Description

Updates to EPIC hosts (RDHPCS on premises, not yet CSPs) for spack-stack/release/2.0

Dependencies

None

Issues addressed

Working towards #1835

Applications affected

None

Systems affected

  • derecho
  • gaea-c6
  • hercules
  • orion
  • ursa

N.B. Once environments are installed on EPIC hosts, the following modulefiles require manual editing:

  • <env dir>/modules/Core/stack-<compiler>/<version>.lua
  • <env dir>/modules/<compiler>/stack-<mpi>/<version>.lua

In each module file, reverse the order of the following two stanzas:

  • -- spack compiler module hierarchy
  • -- prerequisite modules

RDHPCS hosts often (always?) provide modules for commonly used packages (e.g. hdf5, netcdf-c) that are also built within the stack. Loading system compiler / mpi modules after loading core stack components leads to confusion later on, as MODULEPATH will then necessarily prefer system-provided package modules rather than stack-provided package modules.

Testing

  • CI: Note whether the automatic tests (GitHub actions tests that run automatically for every commit) pass or not
    • GitHub actions CI tests pass
    • GitHub actions CI tests do not pass (provide explanation)
    • GitHub actions CI tests skipped (provide explanation if necessary)
  • New tests added: List and describe any new tests added to GitHub actions
    • ...
  • Additional testing: Add information on any additional tests conducted
    • ...

Checklist

  • This PR addresses one issue/problem/enhancement or has a very good reason for not doing so.
  • These changes have been tested on the affected systems and applications.
  • All dependency PRs/issues have been resolved and this PR can be merged.
  • All necessary updates to the documentation on readthedocs are included in this PR
    • For site config updates, check in particular doc/source/PreConfiguredSites.rst and doc/source/MaintainersSection.rst
  • All necessary updates to the spack-stack wiki will be made when this PR is merged

climbfuji and others added 30 commits December 11, 2025 17:53
Copy link
Collaborator

@climbfuji climbfuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).

But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason this has the suffix DO NOT USE, whereas the one on Orion doesn't?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[email protected] is not yet installed on hercules; that is expected to happen after the new year.

The pathing is all wrong (it's orion's pathing) but having the yaml file in place makes it easier to edit paths when [email protected] is installed on hercules.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to maintain two different gcc versions for Gaea C6 (and Derecho)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to maintain two different versions on either host.

Because there was difficulty building envs with GNU on those hosts (and it's not a requirement on gaea-c6, however I was trying it out for the sake of confirming configurations), I was overly hopeful that one would work.

Not a problem to pick one version and remove the extraneous one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch between this and the non-hpcx version with --compiler=oneapi-2025.2.1-hpcx and --compiler=oneapi-2025.2.1 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed you can. 😄

@AlexanderRichert-NOAA
Copy link
Collaborator

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).

But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Not sure I 100% understand the issue, but we've run into similar issues to what @rickgrubin-noaa described and resolved them at least in part by setting LMOD_TMOD_FIND_FIRST. On Acorn/WCOSS2 we also unset some paths to be sure to avoid loading the NCO-installed modules. Nobody (at EMC/NCO) ever agrees with me about adding hashes to the module versions to avoid this and other issues, but, it's an option :)

@climbfuji
Copy link
Collaborator

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).
But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Not sure I 100% understand the issue, but we've run into similar issues to what @rickgrubin-noaa described and resolved them at least in part by setting LMOD_TMOD_FIND_FIRST. On Acorn/WCOSS2 we also unset some paths to be sure to avoid loading the NCO-installed modules. Nobody (at EMC/NCO) ever agrees with me about adding hashes to the module versions to avoid this and other issues, but, it's an option :)

We used to have LMOD_TMOD_FIND_FIRST for Derecho in the past, too, and I did suggest that as a possible solution when the issues was first mentioned. But maybe that doesn't work in this case.

@rickgrubin-noaa
Copy link
Collaborator Author

It's a bit unsatisfactory and confusing why these manual modifications to the modulepath / load order in the meta-modules is required for these platforms, but for none of the others (Acorn, NRL systems).
But ok, we can hopefully fix that for 2.1.0 so that no manual modifications are required.

Not sure I 100% understand the issue, but we've run into similar issues to what @rickgrubin-noaa described and resolved them at least in part by setting LMOD_TMOD_FIND_FIRST. On Acorn/WCOSS2 we also unset some paths to be sure to avoid loading the NCO-installed modules. Nobody (at EMC/NCO) ever agrees with me about adding hashes to the module versions to avoid this and other issues, but, it's an option :)

We used to have LMOD_TMOD_FIND_FIRST for Derecho in the past, too, and I did suggest that as a possible solution when the issues was first mentioned. But maybe that doesn't work in this case.

I don't recall using LMOD_TMOD_FIND_FIRST being suggested (but that's a user error) -- I will give it a try.

@climbfuji climbfuji mentioned this pull request Dec 31, 2025
55 tasks
@climbfuji
Copy link
Collaborator

@rickgrubin-noaa Is this ready to merge?

@rickgrubin-noaa
Copy link
Collaborator Author

Not sure I 100% understand the issue, but we've run into similar issues to what @rickgrubin-noaa described and resolved them at least in part by setting LMOD_TMOD_FIND_FIRST. On Acorn/WCOSS2 we also unset some paths to be sure to avoid loading the NCO-installed modules. Nobody (at EMC/NCO) ever agrees with me about adding hashes to the module versions to avoid this and other issues, but, it's an option :)

We used to have LMOD_TMOD_FIND_FIRST for Derecho in the past, too, and I did suggest that as a possible solution when the issues was first mentioned. But maybe that doesn't work in this case.

I don't recall using LMOD_TMOD_FIND_FIRST being suggested (but that's a user error) -- I will give it a try.

Setting

LMOD_TMOD_FIND_FIRST=yes

per Lmod docs did not result in the desired effect. I'll spend a bit of time reading N/V/V: Picking modules when there are multiple directories in MODULEPATH and hopefully come up with something.

@rickgrubin-noaa
Copy link
Collaborator Author

@rickgrubin-noaa Is this ready to merge?

Other than deciding whether or not to choose one version of GNU compilers for derecho and gaea-c6, it's ready to merge.

@rickgrubin-noaa
Copy link
Collaborator Author

@rickgrubin-noaa Is this ready to merge?

Other than deciding whether or not to choose one version of GNU compilers for derecho and gaea-c6, it's ready to merge.

@climbfuji if this can wait a bit while I configure hercules for the latest oneAPI compiler + MPI -- MSU folks just sent notification of having installed it -- I can update that config and have it merged.

@climbfuji
Copy link
Collaborator

@rickgrubin-noaa Is this ready to merge?

Other than deciding whether or not to choose one version of GNU compilers for derecho and gaea-c6, it's ready to merge.

@climbfuji if this can wait a bit while I configure hercules for the latest oneAPI compiler + MPI -- MSU folks just sent notification of having installed it -- I can update that config and have it merged.

Ok, let's wait for the Hercules update, then merge, then package up the release. Thanks!

@natalie-perlin
Copy link
Collaborator

@climbfuji @rickgrubin-noaa - Please note that a couple of more packages were needed to build ufs-weather-model using the test installation ue-oneapi-2025.2.1.

See my comment on using [email protected] to build previously reported issue with MOM6; it built successfully after the spack-stack environment and modules have been built:
ufs-community/ufs-weather-model#2860 (comment)

A chained environment have been build using upstream env of Rick's test install location, using the same repository [email protected]:rickgrubin-noaa/spack-stack.git and feature/derecho-release-2.0.
Packages added to the environment

Upstream environment:
/glade/derecho/scratch/grubin/spack-stack/envs/ue-oneapi-2025.2.1/install
Chained environment:
/glade/derecho/scratch/nperlin/spack-stack-2.0.0/envs/ue-oneapi-2025.2.1/install

@climbfuji
Copy link
Collaborator

@natalie-perlin sp will not be supported by spack-stack-2.0.0. The UFS will have to update its codebase to use ip instead. nccmp is already part of spack-stack, maybe it's just missing in ufs-weather-model-env? Or is it built with gcc and something didn't work as intended for setting up the module paths?

@rickgrubin-noaa
Copy link
Collaborator Author

@natalie-perlin sp will not be supported by spack-stack-2.0.0. The UFS will have to update its codebase to use ip instead. nccmp is already part of spack-stack, maybe it's just missing in ufs-weather-model-env? Or is it built with gcc and something didn't work as intended for setting up the module paths?

nccmp is not specified in the ufs-weather-model-env module file; it is built into the stack at version 1.9.0.1 and can be loaded separately.

@climbfuji
Copy link
Collaborator

@natalie-perlin sp will not be supported by spack-stack-2.0.0. The UFS will have to update its codebase to use ip instead. nccmp is already part of spack-stack, maybe it's just missing in ufs-weather-model-env? Or is it built with gcc and something didn't work as intended for setting up the module paths?

nccmp is not specified in the ufs-weather-model-env module file; it is built into the stack at version 1.9.0.1 and can be loaded separately.

I think it's fine to load it explicitly for now, and then add it to ufs-weather-model-env in spack-stack-2.1.0 (as long as someone remembers or creates an issue for that).

@natalie-perlin
Copy link
Collaborator

nccmp is not specified in the ufs-weather-model-env module file; it is built into the stack at version 1.9.0.1 and can be loaded separately.

I think it's fine to load it explicitly for now, and then add it to ufs-weather-model-env in spack-stack-2.1.0 (as long as someone remembers or creates an issue for that).

Oh, OK, thank you. Not being a part of ufs-weather-model-env, does that mean then in order the default ue-oneapi-2025.2.1 to contain nccmp, it needs to be added to the spack.yaml, or rather to corresponding scripts that form spack.yaml when environment in created?

When building my chained environment using all the default and upstream install tree of /glade/derecho/scratch/grubin/spack-stack/envs/ue-oneapi-2025.2.1/install , it did not appear as being built (spack find nccmp returned nothing). If it is not a part of ufs-weather-model-env dependency tree and not explicitly specified in spack.yaml, this is expected, as I understand.
Please correct me if I'm wrong!

@rickgrubin-noaa
Copy link
Collaborator Author

nccmp is not specified in the ufs-weather-model-env module file; it is built into the stack at version 1.9.0.1 and can be loaded separately.

I think it's fine to load it explicitly for now, and then add it to ufs-weather-model-env in spack-stack-2.1.0 (as long as someone remembers or creates an issue for that).

Oh, OK, thank you. Not being a part of ufs-weather-model-env, does that mean then in order the default ue-oneapi-2025.2.1 to contain nccmp, it needs to be added to the spack.yaml, or rather to corresponding scripts that form spack.yaml when environment in created?

When building my chained environment using all the default and upstream install tree of /glade/derecho/scratch/grubin/spack-stack/envs/ue-oneapi-2025.2.1/install , it did not appear as being built (spack find nccmp returned nothing). If it is not a part of ufs-weather-model-env dependency tree and not explicitly specified in spack.yaml, this is expected, as I understand. Please correct me if I'm wrong!

[email protected] is built into the stack by default; it is specified in <env>/common/packages.yaml. When running UFS WM, the corresponding module must be loaded.

Relative to the test stacks built on derecho:

From a user's perspective:

module use /glade/derecho/scratch/grubin/spack-stack/envs/ue-oneapi-2025.2.1/modules/Core
module load stack-intel-oneapi-compilers/2025.2.1
module load stack-cray-mpich/8.1.32
module load nccmp/1.9.0.1
module list

Currently Loaded Modules:
  1) intel/2025.2.1                              6) libfabric/1.22.0         11) zstd/1.5.7      16) netcdf-c/4.9.2
  2) stack-intel-oneapi-compilers/2025.2.1       7) stack-cray-mpich/8.1.32  12) c-blosc/1.21.6  17) nccmp/1.9.0.1
  3) ncarenv/25.10                         (S)   8) glibc/2.38               13) nghttp2/1.65.0
  4) craype/2.7.34                               9) snappy/1.2.1             14) curl/8.11.1
  5) cray-mpich/8.1.32                          10) zlib/1.2.11              15) hdf5/1.14.5

From a spack perspective:

cd /glade/derecho/scratch/grubin/spack-stack
 . ./setup.sh 
[...]
spack env activate -p envs/ue-oneapi-2025.2.1
spack find nccmp
==> In environment /glade/derecho/scratch/grubin/spack-stack/envs/ue-oneapi-2025.2.1
[...]
-- linux-sles15-zen3 / %c,[email protected] -------------------
[email protected]
==> 1 installed package
==> 0 concretized packages to be installed (show with `spack find -c`)

@rickgrubin-noaa
Copy link
Collaborator Author

@rickgrubin-noaa Is this ready to merge?

Other than deciding whether or not to choose one version of GNU compilers for derecho and gaea-c6, it's ready to merge.

@climbfuji if this can wait a bit while I configure hercules for the latest oneAPI compiler + MPI -- MSU folks just sent notification of having installed it -- I can update that config and have it merged.

Ok, let's wait for the Hercules update, then merge, then package up the release. Thanks!

@climbfuji -- all set for hercules.

@climbfuji
Copy link
Collaborator

@rickgrubin-noaa Is this ready to merge?

Other than deciding whether or not to choose one version of GNU compilers for derecho and gaea-c6, it's ready to merge.

@climbfuji if this can wait a bit while I configure hercules for the latest oneAPI compiler + MPI -- MSU folks just sent notification of having installed it -- I can update that config and have it merged.

Ok, let's wait for the Hercules update, then merge, then package up the release. Thanks!

@climbfuji -- all set for hercules.

Perfect, thanks! I'll merge this, tag the submodules, and create a PR for spack-stack to use those tags.

I will also share draft release notes momentarily.

@climbfuji climbfuji merged commit 5a3f82f into JCSDA:release/2.0 Jan 5, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in spack-stack-2.0.x (2025 Q4) Jan 5, 2026
@natalie-perlin
Copy link
Collaborator

@rickgrubin-noaa Is this ready to merge?

Other than deciding whether or not to choose one version of GNU compilers for derecho and gaea-c6, it's ready to merge.

@climbfuji if this can wait a bit while I configure hercules for the latest oneAPI compiler + MPI -- MSU folks just sent notification of having installed it -- I can update that config and have it merged.

Ok, let's wait for the Hercules update, then merge, then package up the release. Thanks!

@climbfuji -- all set for hercules.

@rickgrubin-noaa @climbfuji - Thank you for confirming!
I repeated the steps to build the chained environment in /glade/derecho/scratch/nperlin/spack-stack-2.0.0/envs/ue-test-oneapi-2025.2.1/, and this time nccmp did show up as built properly in the upstream:

-- linux-sles15-zen3 / %c,[email protected] -------------------
[email protected]
==> 1 installed package

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects

Development

Successfully merging this pull request may close these issues.

4 participants