Adding histograms comparatively slow with growing category axes #719

jrueb · 2022-03-14T17:48:55Z

jrueb
Mar 14, 2022

I am currently migrating from Coffea histograms to Boost histograms. And as I expected, Boost histograms generally turned out to be faster. However, in one case they were much slower. This happens when I try to add two histograms, one with a lot of categories in their growing category axis and another one with a different number of categories. See the script below for an example. For me this is a very common situation because this is what I do when accumulating histograms. There, the small histogram is the result of one particular chunk that was processed on the cluster and the big one is the result of the already accumulated histograms.

In the example below Coffea manages to add the histograms around 30 times faster than boost_histogram. Coffea manages category axes by using dicts to store the counts. Is this something one generally has to be aware of when using growing axes or could this be improved?

Python version 3.8.6
boost_histogram version 1.3.1
Coffea version 0.7.13

import numpy as np
import coffea.hist
import boost_histogram as bh
from time import time

coffea_hist1 = coffea.hist.Hist(
    "Events",
    coffea.hist.Cat(name="dataset", label="dataset"),
    coffea.hist.Cat(name="channel", label="dataset"),
    coffea.hist.Cat(name="systematic", label="dataset"),
    coffea.hist.Bin("x", "x", 10, 0, 1)
)
coffea_hist2 = coffea_hist1.copy()
hist1 = bh.Histogram(
    bh.axis.StrCategory([], growth=True),
    bh.axis.StrCategory([], growth=True),
    bh.axis.StrCategory([], growth=True),
    bh.axis.Regular(10, 0, 1)
)
hist2 = hist1.copy()

for dataset in (str(i) for i in range(15)):
    for channel in (str(i) for i in range(3)):
        for systematic in (str(i) for i in range(100)):
            coffea_hist1.fill(dataset=dataset, channel=channel, systematic=systematic, x=np.array([0.1, 0.2, 0.3]))
            hist1.fill(dataset, channel, systematic, np.array([0.1, 0.2, 0.3]))
coffea_hist2.fill(dataset="0", channel="0", systematic="0", x=np.array([0.1, 0.2, 0.3]))
hist2.fill("0", "0", "0", np.array([0.1, 0.2, 0.3]))

t = time()
coffea_hist1.add(coffea_hist2)
print("Coffea", time() - t)

t = time()
hist1 += hist2
print("Boost", time() - t)

HDembinski · 2022-04-09T11:42:10Z

HDembinski
Apr 9, 2022
Maintainer

For now, you have to be aware of this. There are ways to accelerate filling of histograms with growing axes in Boost.Histogram on the C++ level, but it is a complicated patch. Since boost-histogram does not use a dict to store values, it is slower on the filling, but faster on operations performed on the filled histogram than coffea. Ideally, you always want both things to be as efficient as possible, but here one has to decide on a trade-off. coffea made one choice while we another one.

boostorg/histogram#278

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding histograms comparatively slow with growing category axes #719

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Adding histograms comparatively slow with growing category axes #719

jrueb Mar 14, 2022

Replies: 1 comment

HDembinski Apr 9, 2022 Maintainer

jrueb
Mar 14, 2022

HDembinski
Apr 9, 2022
Maintainer