Document a policy regarding memory allocation #638

lemire · 2024-06-26T16:43:48Z

In C, the malloc function (and similar memory allocation functions) may 'fail' by returning a NULL pointer, or by some other means. Currently, CRoaring does not have a policy regarding failures in case of memory allocation.

We have some checks in place (bailing out from function when the malloc failed), but there is no testing to see what happens when memory allocations can fail. So it is entirely possible that there are bugs. I would hope that static analysis could reveal such bugs, but I am not sure how to go about it.

We recently add 64-bit roaring bitmaps to C and the implementation did not check for NULLs. I have proposed to add checks at #633

How do we handle this?

That's not as clear cut as it seems. On many systems, malloc effectively always succeeds, and the failure comes when you try to access the memory. So checking for a NULL return is less useful than it seems. Furthermore, we specifically allow people to provide their own memory allocators.

The proposal currently is to document our policy (the policy of the CRoaring project). Of course, we first need to agree on the policy. Thus I am inviting comments from anyone who might use CRoaring.

To be clear, I am not looking for code changes (though these would fine) but rather on a policy or agreed-upon convention. Code changes could then be allowed or rejected based on this policy.

Calling on @alexey-milovidov @BenPope @xiaoyong-z @madscientist @SLieve and others...

Credit to @Oppen for pointing out the issue.

The text was updated successfully, but these errors were encountered:

madscientist · 2024-06-26T17:52:02Z

Note that it's not just adding lots of checks: there are some API methods in CRoaring which allocate memory but do not return that memory (directly) and so how would we have to change the API to provide some kind of "error flag" that the user could check to determine whether allocation failed, in situations where we can't just return NULL to the user?

My opinion as I wrote in #633 : I don't think we should try to detect memory allocation failures. We should document that we consider it illegal for memory allocation methods to fail / return NULL, and we should modify the built-in allocator implementation to call abort() or _exit() or similar if NULL is returned from an allocation method. And we shouldn't add checking code for all the allocations done throughout the code.

In addition to considering how much work it would be for CRoaring to add all the memory checks and cleanups throughout the code, we have to realistically consider what the CRoaring user would do if they got back NULL: they would also have to commit to checking all the returned values... and if they get back NULL what would they do, other than die themselves?

We could consider how users for whom this behavior is insufficient or undesirable could address it. For C++ users they could register memory allocators which throw an exception on failure, maybe: it wouldn't clean up (would be a memory leak and possibly even leave the roaring object in an invalid state, but it wouldn't crash).

We could consider how difficult it would be to guarantee that a throw from an allocator would NOT leave the roaring object in an invalid state. This would require careful code review, to be sure.

Oppen · 2024-06-26T20:11:19Z

Note that it's not just adding lots of checks: there are some API methods in CRoaring which allocate memory but do not return that memory (directly) and so how would we have to change the API to provide some kind of "error flag" that the user could check to determine whether allocation failed, in situations where we can't just return NULL to the user?

My opinion as I wrote in #633 : I don't think we should try to detect memory allocation failures. We should document that we consider it illegal for memory allocation methods to fail / return NULL, and we should modify the built-in allocator implementation to call abort() or _exit() or similar if NULL is returned from an allocation method. And we shouldn't add checking code for all the allocations done throughout the code.

Rather than calling abort, maybe add a callback to the allocator so derived libraries like python-croaring can react in a context-aware way, e.g. triggering a MemoryError exception. By default it would be a call to abort, but users would have a say if they want other behavior (as long as it's still abort-ish).

In addition to considering how much work it would be for CRoaring to add all the memory checks and cleanups throughout the code, we have to realistically consider what the CRoaring user would do if they got back NULL: they would also have to commit to checking all the returned values... and if they get back NULL what would they do, other than die themselves?

Another consideration is whether we realistically expect to run on one of the platforms where dereferencing NULL means anything other than SIGSEGV. While UB is not happy, in most cases it ends up being just a crash, which is the least worse scenario. Do we expect CRoaring to work on, e.g., microcontrollers? I'm inclined to say no... Bare metal in general is unlikely.

We could consider how users for whom this behavior is insufficient or undesirable could address it. For C++ users they could register memory allocators which throw an exception on failure, maybe: it wouldn't clean up (would be a memory leak and possibly even leave the roaring object in an invalid state, but it wouldn't crash).

We could consider how difficult it would be to guarantee that a throw from an allocator would NOT leave the roaring object in an invalid state. This would require careful code review, to be sure.

I think we can provide a best-effort to clean a broken object if we want. Not repair, not validate, but free as many of the resources it holds after the error is detected as it can safely recognize. I don't think we can guarantee not to leave the roaring object in an invalid state after throwing, and I think it's OK to document it as such. The cleanup function would be a nice-to-have.

Re: possibility of failure: I can think of one scenario outside of embedded which is 32-bits processes. Those can realistically exhaust the address space in the kind of application you'd use CRoaring for, and that is pretty much the only failure condition in modern Linux (no idea about other OSes).

I will abstain of voting since my stakes aren't as high as they used to, I switched projects about two years ago and I haven't found an excuse to use bitmaps in the later projects I took on. I still feel OK giving opinions and debating tho.

lemire · 2024-06-26T20:14:39Z

@Oppen For these sorts of things, there is no voting I think. What I hope is that we can arrive at a consensus.

madscientist · 2024-06-26T21:35:59Z

Rather than calling abort, maybe add a callback to the allocator so derived libraries like python-croaring can react in a context-aware way,

We already have the ability for the CRoaring user to provide alternate allocator methods. Isn't that sufficient to allow different behaviors on failure? Do we really need a separate callback, beyond that?

I think we can provide a best-effort to clean a broken object if we want. Not repair, not validate, but free as many of the resources it holds after the error is detected as it can safely recognize. I don't think we can guarantee not to leave the roaring object in an invalid state after throwing, and I think it's OK to document it as such.

That's exactly what the (32bit) code does now: an incomplete "best effort attempt" to free memory. But speaking for myself that's what I don't want. I don't see the value in this partial implementation: for people who really want to use CRoaring in environments where memory is strictly limited and they have a way of dealing with an error due to out of memory, a partial implementation is not sufficient for them. They need guarantees, presumably with testing proving that it works. And for the rest of us it doesn't matter since we'll just die anyway.

Either we should commit to this and finish the implementation so it's really correct, or abandon it as too much overhead/etc. and instead say that we expect that whatever allocator function we called it would either (a) return the requested memory, or (b) not return at all.

What I was referring to by "not leaving the object in an invalid state" was not cleaning up after failure (allocators cannot fail!). Instead, I meant re-order the code so that we avoid making state changes that invalidate the CRoaring object, until after we've allocated all the memory we need to. This way if the allocator does something like throw an exception we have a reasonable expectation that the CRoaring object is still valid, in any context where the exception is caught.

Maybe that is too hard or not worth it. Certainly it will only help C++ programs since there are no exceptions in C.

Dr-Emann · 2024-06-27T02:45:02Z

I'm really of two minds about this: It's really hard to get right (error paths that are almost never hit are really hard to always be perfect at). But between windows (no overcommit), 32 bit systems, and custom allocators (setting up a limited custom allocator which ensures we never allocate over 1GiB or something seems like a possibly worthwhile thing to support), it would be nice to handle OOM gracefully.

Also, C has longjmp, a callback could theoretically longjmp out to recover the same way exceptions could be used in C++

madscientist · 2024-06-27T12:00:52Z

Yes I thought of longjmp but it's difficult to see how a user would make use of it, other than to provide a clean exit path. That is, it's hard to imagine using longjmp around each call to CRoaring to detect if that particular call failed and handle it.

The question is what do you mean by "handle OOM gracefully". What operations would be done in such a handler, that couldn't be accomplished by replacing the allocator functions?

Basically I think we have three options:

Keep the current behavior which is a kind of ad hoc situation where some allocations are checked and some are not
Remove all checks and define allocators such that they never return NULL
Fix the implementation so that it really handles OOM properly

In order to work on option 3 the very first thing we have to do is examine the public API and determine which methods could allocate memory. Then we have to decide how to communicate back to the user that memory allocation failed. This will definitely result in an API change. Some changes can be "quiet"; for example:

/**
 * Add all values in range [min, max]
 */
void roaring_bitmap_add_range_closed(roaring_bitmap_t *r, uint32_t min,
                                     uint32_t max);

could be changed to return a bool and return false if memory allocation failed. But what do we do about things like:

/**
 * Add value x
 * Returns true if a new value was added, false if the value already existed.
 */
bool roaring_bitmap_add_checked(roaring_bitmap_t *r, uint32_t x);

which already uses its return value for something else? How does the caller determine whether this call failed due to an OOM issue?

I suppose the only option is to create a separate function that can be called to determine whether there was an OOM error, that users who care can query after each CRoaring call to check for this problem.

lemire added question documentation labels Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document a policy regarding memory allocation #638

Document a policy regarding memory allocation #638

lemire commented Jun 26, 2024

madscientist commented Jun 26, 2024

Oppen commented Jun 26, 2024

lemire commented Jun 26, 2024

madscientist commented Jun 26, 2024

Dr-Emann commented Jun 27, 2024 •

edited

Loading

madscientist commented Jun 27, 2024

Document a policy regarding memory allocation #638

Document a policy regarding memory allocation #638

Comments

lemire commented Jun 26, 2024

madscientist commented Jun 26, 2024

Oppen commented Jun 26, 2024

lemire commented Jun 26, 2024

madscientist commented Jun 26, 2024

Dr-Emann commented Jun 27, 2024 • edited Loading

madscientist commented Jun 27, 2024

Dr-Emann commented Jun 27, 2024 •

edited

Loading