[CUB][device_reduce] Add a version for CUB device reduce primitives with inputs temp_storage and execution env #6497
-
|
Hello ! I hope this is the right discussion section to ask this. Let me know if it fits better in CUB/ and I'll move it. I just noticed something in CUB's device reduce (
As suggested by the comment I'd love to contribute, but I do not want to overstep. Let me know if:
Note: (This also applies to Pardon my pehaps naive questions. Related PR: [EPIC]: Track env-based overloads implementation for all CUB device primitives #5635 Cheers ! [EDIT] Added option 2. that feels better, but I'm still not familiar enough with |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
|
Hi! You are definitely asking the question in the right place. The environment parameter can contain a memory resource which is used for the allocation and deallocation of temporary storage. This does not mean that you cannot cache this allocation or provide it from anywhere else (like a pre-allocated region) etc. Just use a memory resource that does what you want. I seems that we don't have API examples showing memory resources, or maybe I didn't find them. @gevtushenko do we have examples for memory resources as environments? Also, @rbourgeois33 if you could give us more details on the use case you are trying to solve, we may be able to provide a better answer. |
Beta Was this translation helpful? Give feedback.
-
|
Hello @rbourgeois33 and thank you for starting the discussion!
The comment you are referring to is a bit orthogonal to the overload an question. The TODO is about using
This is something that we debate ourselves. One one end, we are trying to keep API surface minimal to reduce maintenance cost. This is why our hope is that the correct approach would be based purely on the memory resource concept. If the allocation cost is a problem, caching memory resource could help. If even this is too much overhead, you could, say, store the number of bytes in the allocate member function and throw an exception to return from the CUB env-based interface early. After that, you could pre-allocate the memory and invoke env-based interface with a memory resource that simply returns a pointer to the pre-allocated memory. But the UX of juggling these memory resources to get functionality of
CCCL is open-source project, so contributions are most welcome! I can see a few ways you could contribute:
The problem is that we don't have non-experimentl memory resources yet, so the only example on using env-based interface with memory resources is in |
Beta Was this translation helpful? Give feedback.
-
|
Hello @gevtushenko @bernhardmgruber, thanks a lot for the information!
I was simply taking a closer look at CUB to explore a different implementation of the reduction (using parallel lookback) and came across this topic first. I don’t have a specific use case for this functionality right now. My curiosity came from benchmarking reductions when comparing I understand that this should only be addressed if it solves a real user problem. Still, it’s possible that someone benchmarking the env-based reduce might not see SoL performance simply due to that extra allocation, which would be a shame :). To avoid modifying the existing reduce primitives, we would maybe need a
Yes, I’d be happy to contribute to that. I’ll definitely take a look! |
Beta Was this translation helpful? Give feedback.
-
|
Hello again @gevtushenko ! Feel free to take a look at my attempt at implementing a env-based device segmented Sum, as well as this sample code that uses it. Some remarks:
Is that what you had in mind ? I can make corrections / changes if necessary. if it's all good, I can go ahead and write proper tests / generalize it to other segmented reductions, depending on what you deem most useful. Guarantee API in environments seems like a great idea. But this would imply work in the cudax/ section of cccl, right ? Moreover, I can close this discussion and move to e.g. a pull-request as you answered my questions :) cheers ! |
Beta Was this translation helpful? Give feedback.
@rbourgeois33 this looks right!
Consider adding a static assert that if determinism is specified - it's run-to-run or not-guaranteed.
Tuning API is still work-in-progress, so it's fine to omit it at this time.
That'd be great! Feel free to open draft PR as is and we'll provide all the necessary context there.