Sparse histograms and maped-based storage

From [this post](https://root-forum.cern.ch/t/boost-1-70-released-new-library-boost-histogram/33595/4) from some years ago, there seems to be the implication that having a sparse histogram should be feasible using boost-histograms. Is there a map-based storage supported out of the box that we could use, or that would be easy to implement to target the scikit-hep python bindings?

For some context, we implemented some histograms using scikit-hep hist. These histograms have a handful of categorical axes, and ~300 regular/variable axes. The histograms are pretty sparse, with 97% of zeros of about a billion bins. In our HEP analysis, adding the results becomes impractical as they need > 120GB of memory.  Using a subclass of hist that internally uses a dictionary of histograms [categorical keys -> dense hist],  we were able to reduce the memory usage to about 20GB, which made the approach practical again in terms of memory and runtime. This subclass tries to mimic the h[...] access of the original hist, and uses awkward arrays where the original hist would return numpy arrays, etc. However, it is somewhat inefficient as the storage (in the sense of the function that tells which bin to update) is completely in python and has to use python dictionaries.

Ideally, we are looking for a storage that would enable spare histograms that then can be used in scikit-hep's boost-histograms and hist. We would appreciate any guidance on this topic.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse histograms and maped-based storage #389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sparse histograms and maped-based storage #389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions