Skip to content

Sparse histograms and maped-based storage #389

@btovar

Description

@btovar

From this post from some years ago, there seems to be the implication that having a sparse histogram should be feasible using boost-histograms. Is there a map-based storage supported out of the box that we could use, or that would be easy to implement to target the scikit-hep python bindings?

For some context, we implemented some histograms using scikit-hep hist. These histograms have a handful of categorical axes, and ~300 regular/variable axes. The histograms are pretty sparse, with 97% of zeros of about a billion bins. In our HEP analysis, adding the results becomes impractical as they need > 120GB of memory. Using a subclass of hist that internally uses a dictionary of histograms [categorical keys -> dense hist], we were able to reduce the memory usage to about 20GB, which made the approach practical again in terms of memory and runtime. This subclass tries to mimic the h[...] access of the original hist, and uses awkward arrays where the original hist would return numpy arrays, etc. However, it is somewhat inefficient as the storage (in the sense of the function that tells which bin to update) is completely in python and has to use python dictionaries.

Ideally, we are looking for a storage that would enable spare histograms that then can be used in scikit-hep's boost-histograms and hist. We would appreciate any guidance on this topic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions