Skip to content

Conversation

@hydazz
Copy link

@hydazz hydazz commented Oct 19, 2025

This is a starter PR to add support for Talos OS's different nvidia paths.

Tested the gpu component with the changes here in my environment and it works.

Feedback is needed, as i'm unsure how to add the usr/local/glibc path to CDI nicely, I don't believe getTalosLibrarySearchPaths will cut it globally...

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@asymingt
Copy link

I can independently confirm that this works on my Talos cluster. After installing with helm I finally see resource slices being made available on a machine with five RTX A4000 GPUs. Thank you, @hydazz!

$ kubectl get resourceslices
NAME                                 NODE            DRIVER           POOL            AGE
talos-pxs-ia1-gpu.nvidia.com-gsvft   talos-pxs-ia1   gpu.nvidia.com   talos-pxs-ia1   11m

For repeatability, you will need a container image to be built. I have pushed one to asymingt/k8s-dra-driver-gpu. You will need to modify this line to asymingt/k8s-dra-driver-gpu:v25.8.0-dev before installing the chart this way:

helm upgrade -i nvidia-dra-driver-gpu ./k8s-dra-driver-gpu/deployments/helm/nvidia-dra-driver-gpu   \
   --create-namespace  --namespace drivers  \
   --set gpuResourcesEnabledOverride=true   \
   --set resources.gpus.enabled=true  \
   --set resources.computeDomains.enabled=false 
   --wait

To optionally rebuild the container image, install docker + qemu-binfmt + buildx, checkout this code and run:

export IMAGE_NAME=<your_docker_hub>/k8s-dra-driver-gpu
export VERSION=v25.8.0-dev
export PUSH_ON_BUILD=true
export BUILD_MULTI_ARCH_IMAGES=true

make -f deployments/container/Makefile build

@hydazz
Copy link
Author

hydazz commented Nov 15, 2025

I believe this is set to be fixed on the Talos side, by them installing the nvidia stuff where this expects it to go, not the other way around

@asymingt
Copy link

asymingt commented Nov 16, 2025

While we wait for Talos to update its driver install location, I've been trying to get MPS working on Talos using this PR branch and the following helm values.

gpuResourcesEnabledOverride: true
resources:
  gpus:
    enabled: true
  computeDomains:
    enabled: false
featureGates:
  MPSSupport: true

Looks like the mps-control-daemon keeps restarting with the following error:

$ k logs mps-control-daemon-49e0f7b0-e884-4ab0-ac35-b50bca50f681-e4dlqgl -n drivers
chroot: can't execute 'sh': No such file or directory

It's probably related to this issue: #469

I've opened a PR to fix it on your branch: hydazz#1

@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Nov 24, 2025
@klueska klueska added this to the unscheduled milestone Nov 24, 2025
@klueska
Copy link
Collaborator

klueska commented Nov 24, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@hydazz
Copy link
Author

hydazz commented Nov 25, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@klueska I don't have definitive knowledge, I just inferred that conclusion based on:
https://discord.com/channels/673534664354430999/942576972943491113/1434096797562703983

if we could move this to a github discussion under extensions repo, we could collaborate more (I believe /usr/local/glibc/usr/lib was a wrong choice of path from first place [thinking about merged /usr/ though it was kind of done to fix issues with musl libs co-existing), but let's do a discussion on a good path moving forward and trying to use the operator and dra plugins as much as possible without platform specific hacks

(I could not find such referenced discussion)

siderolabs/extensions#836
siderolabs/extensions#476
#605

I don't know if there is talks between nvidia/talos, or whats outside of linked above, but it could easily be fixed here, just with something better than getTalosLibrarySearchPaths (pretty easily?), or on the extension side, but thats probably a larger change.

Perhaps @frezbo would have more insight?

Comment on lines 49 to 55
func getTalosLibrarySearchPaths() []string {
return []string{
"/driver-root/usr/local/glibc/usr/lib",
"/driver-root/usr/local/glibc/lib",
"/driver-root/usr/local/glibc/lib64",
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar is this something we would want to add directly to nvcdi in the nvidia-container-toolkit as a standard search path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a problem in adding this to the toolkit. At the moment the defaults are defined at quite a low level (which is where @hydazz has added them in NVIDIA/nvidia-container-toolkit#1621) and we may want to consider making these easier to specify at a higher level.

@frezbo
Copy link

frezbo commented Dec 2, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@klueska I don't have definitive knowledge, I just inferred that conclusion based on: https://discord.com/channels/673534664354430999/942576972943491113/1434096797562703983

if we could move this to a github discussion under extensions repo, we could collaborate more (I believe /usr/local/glibc/usr/lib was a wrong choice of path from first place [thinking about merged /usr/ though it was kind of done to fix issues with musl libs co-existing), but let's do a discussion on a good path moving forward and trying to use the operator and dra plugins as much as possible without platform specific hacks

(I could not find such referenced discussion)

siderolabs/extensions#836 siderolabs/extensions#476 #605

I don't know if there is talks between nvidia/talos, or whats outside of linked above, but it could easily be fixed here, just with something better than getTalosLibrarySearchPaths (pretty easily?), or on the extension side, but thats probably a larger change.

Perhaps @frezbo would have more insight?

It would be nice if these paths are supported on the nvidia side, we're (SideroLabs) is open to using better paths, but we have a constraint that it cannot be standard /usr/lib since Talos base is musl, so glibc and musl libs needs to exist at different paths it could be as well usr/local/nvidia/lib for example

@jgehrcke
Copy link
Collaborator

@klueska what do you think: can we still do something here for the next release? I feel like we should. But now it's rather tight again.

Signed-off-by: hydazz <[email protected]>
@hydazz
Copy link
Author

hydazz commented Jan 30, 2026

opened PR in the toolkit to remove the overwrite here
NVIDIA/nvidia-container-toolkit#1621

please review and lmk if any changes are needed, keen to jump onto the DRA train 🙂

@elezar
Copy link
Member

elezar commented Jan 30, 2026

@hydazz @jgehrcke if we're expecting the toolkit to be updated to include this change, I don't think that's something that can be done for the upcoming release.

Would a middle ground be adding a configurable option (envvar / config file option) that allows a user to specify the paths in the container to be searched for libraries explicilty? This can be passed to the CDI library on construction and used in the prestart scripts.

Note that although NVIDIA/nvidia-container-toolkit#1621 gets us some of the way there, we may need to also update the detection logic to also be able to locate nvidia-smi at the non-standard location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

6 participants