[CUB][device_reduce] Add a version for CUB device reduce primitives with inputs temp_storage and execution env #6497

rbourgeois33 · 2025-11-05T14:21:34Z

rbourgeois33
Nov 5, 2025

Hello ! I hope this is the right discussion section to ask this. Let me know if it fits better in CUB/ and I'll move it.

I just noticed something in CUB's device reduce (device_reduce.cuh):
There are two versions of the Sum method:

The "simple" Sum. This takes a temporary storage as input, queries the required size for that storage if it is not allocated, or performs a reduction using that storage if it is allocated. More importantly, this Sum does not take a cuda::std::execution::env as input, implying that the user cannot specify details e.g. the determinism level he desires.
The "env-based" Sum. This self-allocates its own temporary storage and takes an "environment" as input.

As suggested by the comment // TODO(gevtushenko): use uninitialized buffer when it's available, it would be nice to be able to provide a temporary storage to the "env-based" one, with inputs d_temp_storage, temp_storage_bytes.

I'd love to contribute, but I do not want to overstep. Let me know if:

I'm understanding the situation/the TODO correctly ?
I'm allowed to try to contribute to this ? (pinging @gevtushenko )
The correct approach to this is:
1. to create a 3rd sum that takes a temporary storage and a cuda::std::execution::env ? This could get quite redundant. There might be some logic to add if the cuda::std::execution::env provides memory ressources as deduced here at compile time.
2. Assuming the mr from the cuda::std::execution::env can contain a buffer, detect it and use it ?

Note: (This also applies to Min, Max, ArgMin, ArgMax and Reduce)

Pardon my pehaps naive questions.

Related PR: [EPIC]: Track env-based overloads implementation for all CUB device primitives #5635

Cheers !

[EDIT] Added option 2. that feels better, but I'm still not familiar enough with cuda::std::execution::env to tell if it makes sense.

Answered by gevtushenko

Nov 13, 2025

Is that what you had in mind ?

@rbourgeois33 this looks right!

I left the determinism deduction commented out, as CUB's segmented reductions do not seem to support non default determinism support at the moment.

Consider adding a static assert that if determinism is specified - it's run-to-run or not-guaranteed.

Same thing for the env's tuning.

Tuning API is still work-in-progress, so it's fine to omit it at this time.

Moreover, I can close this discussion and move to e.g. a pull-request as you answered my questions

That'd be great! Feel free to open draft PR as is and we'll provide all the necessary context there.

View full answer

bernhardmgruber · 2025-11-06T07:55:41Z

bernhardmgruber
Nov 6, 2025
Collaborator

Hi! You are definitely asking the question in the right place.

The environment parameter can contain a memory resource which is used for the allocation and deallocation of temporary storage. This does not mean that you cannot cache this allocation or provide it from anywhere else (like a pre-allocated region) etc. Just use a memory resource that does what you want.

I seems that we don't have API examples showing memory resources, or maybe I didn't find them. @gevtushenko do we have examples for memory resources as environments?

Also, @rbourgeois33 if you could give us more details on the use case you are trying to solve, we may be able to provide a better answer.

0 replies

gevtushenko · 2025-11-06T08:26:32Z

gevtushenko
Nov 6, 2025
Maintainer

Hello @rbourgeois33 and thank you for starting the discussion!

I'm understanding the situation/the TODO correctly ?

The comment you are referring to is a bit orthogonal to the overload an question. The TODO is about using cudax::uninitialized_buffer instead or directly calling allocate / deallocate member functions on the memory resource. This would simplify implementation of env-based overloads with RAII. The problem is that uninitialized buffer has to be taken out of experimental namespace (should be cuda::uninitialized_buffer before we can depend on it in CUB.

The correct approach to this is:

This is something that we debate ourselves. One one end, we are trying to keep API surface minimal to reduce maintenance cost. This is why our hope is that the correct approach would be based purely on the memory resource concept. If the allocation cost is a problem, caching memory resource could help. If even this is too much overhead, you could, say, store the number of bytes in the allocate member function and throw an exception to return from the CUB env-based interface early. After that, you could pre-allocate the memory and invoke env-based interface with a memory resource that simply returns a pointer to the pre-allocated memory. But the UX of juggling these memory resources to get functionality of d_temp_storage, temp_storage_bytes is questionable.

I'm allowed to try to contribute to this ?

CCCL is open-source project, so contributions are most welcome! I can see a few ways you could contribute:

If you try the suggestion above and realize that memory resources are not enough / convenient in your use case, please, let us know. This would help us decide on providing env-based overload taking d_temp_storage, temp_storage_bytes.
If you don't mind contributing on a related topic, we are actually lacking env-based overload on the device segmented reduce taking offse iterators. Having environment-based version there would unblock significant optimization opportunities. Analogous to requirements API, we considered adding guarantees API with which user would be able to tell us about, say, max segment size. On the implementaion side, we'd be able to choose optimal kernel implementation statically. Similar idea applies to segmented sort.

do we have examples for memory resources as environments?

The problem is that we don't have non-experimentl memory resources yet, so the only example on using env-based interface with memory resources is in cudax https://github.com/NVIDIA/cccl/blob/main/cudax/examples/cub_reduce.cu.

0 replies

rbourgeois33 · 2025-11-07T11:30:16Z

rbourgeois33
Nov 7, 2025
Author

Hello @gevtushenko @bernhardmgruber, thanks a lot for the information!

Also, @rbourgeois33 if you could give us more details on the use case you are trying to solve, we may be able to provide a better answer.
If you try the suggestion above and realize that memory resources are not enough / convenient in your use case, please, let us know. This would help us decide on providing env-based overload taking d_temp_storage, temp_storage_bytes.

I was simply taking a closer look at CUB to explore a different implementation of the reduction (using parallel lookback) and came across this topic first.

I don’t have a specific use case for this functionality right now. My curiosity came from benchmarking reductions when comparing Kokkos::parallel_reduce and thrust::reduce a few days ago. Thrust/CUB need to reuse a pre-allocated buffer to achieve SoL performances (otherwise the buffer is allocated/deallocated at each call). It seems Kokkos pre-allocates that buffer at initialization (see this Slack discussion).

I understand that this should only be addressed if it solves a real user problem. Still, it’s possible that someone benchmarking the env-based reduce might not see SoL performance simply due to that extra allocation, which would be a shame :).

To avoid modifying the existing reduce primitives, we would maybe need a mr whose allocate function is memory-pool aware. But as you say, until it's provided by cudax/cub, it's a lot of UX juggling.

If you don't mind contributing on a related topic, we are actually lacking env-based overload on the device segmented reduce taking offset iterators.

Yes, I’d be happy to contribute to that. I’ll definitely take a look!

0 replies

rbourgeois33 · 2025-11-10T15:17:48Z

rbourgeois33
Nov 10, 2025
Author

Hello again @gevtushenko !

Feel free to take a look at my attempt at implementing a env-based device segmented Sum, as well as this sample code that uses it.

Some remarks:

I left the determinism deduction commented out, as CUB's segmented reductions do not seem to support non default determinism support at the moment.
Same thing for the env's tuning.

Is that what you had in mind ? I can make corrections / changes if necessary. if it's all good, I can go ahead and write proper tests / generalize it to other segmented reductions, depending on what you deem most useful.

Guarantee API in environments seems like a great idea. But this would imply work in the cudax/ section of cccl, right ?

Moreover, I can close this discussion and move to e.g. a pull-request as you answered my questions :)

cheers !

2 replies

gevtushenko Nov 13, 2025
Maintainer

Is that what you had in mind ?

@rbourgeois33 this looks right!

I left the determinism deduction commented out, as CUB's segmented reductions do not seem to support non default determinism support at the moment.

Consider adding a static assert that if determinism is specified - it's run-to-run or not-guaranteed.

Same thing for the env's tuning.

Tuning API is still work-in-progress, so it's fine to omit it at this time.

Moreover, I can close this discussion and move to e.g. a pull-request as you answered my questions

That'd be great! Feel free to open draft PR as is and we'll provide all the necessary context there.

Answer selected by rbourgeois33

rbourgeois33 Nov 18, 2025
Author

Sounds good ! Just opened an issue and a PR.

#6673
#6674

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUB][device_reduce] Add a version for CUB device reduce primitives with inputs temp_storage and execution env #6497

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[CUB][device_reduce] Add a version for CUB device reduce primitives with inputs temp_storage and execution env #6497

Uh oh!

Uh oh!

rbourgeois33 Nov 5, 2025

Replies: 4 comments · 2 replies

Uh oh!

bernhardmgruber Nov 6, 2025 Collaborator

Uh oh!

gevtushenko Nov 6, 2025 Maintainer

Uh oh!

Uh oh!

rbourgeois33 Nov 7, 2025 Author

Uh oh!

Uh oh!

rbourgeois33 Nov 10, 2025 Author

Uh oh!

gevtushenko Nov 13, 2025 Maintainer

Uh oh!

rbourgeois33 Nov 18, 2025 Author

rbourgeois33
Nov 5, 2025

Replies: 4 comments 2 replies

bernhardmgruber
Nov 6, 2025
Collaborator

gevtushenko
Nov 6, 2025
Maintainer

rbourgeois33
Nov 7, 2025
Author

rbourgeois33
Nov 10, 2025
Author

gevtushenko Nov 13, 2025
Maintainer

rbourgeois33 Nov 18, 2025
Author