-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting "malloc: Double free" error when overwriting keys too often #8
Comments
Hey I'm really sorry to have taken so long to notice this. I don't know why I didn't get the notification; I just saw it right now. My recollection of that API is that when you pass a value into it, it takes ownership of the memory cell passed in by default (unless you provide an ejection callback), and will free it when the item is ejected. If you're putting in the same value multiple times, you need to stick each copy of the value in a different memory cell. Looking at the example, I would absolutely expect the above loop to lead to a crash even in a single threaded program, because when you insert the memory cell the second time, it marks the first for deletion once it can 'prove' that no other threads might be holding a reference to it. Hope that's helpful. I'll leave this open for a while and will spend more time on it soon just to double check. And I'm going to look at my settings right now to make sure I am getting any tickets right away, really sorry about that! |
Hi John. No worries. I'm glad to hear from you. I thought the project might be abandoned, and it looks like great work. Would it be an error to reuse the same key to set different values? I'm not sure the fact that the values are the same in my code sample is germane to my original issue. I set hatrack aside when I hit this stumbling block. Returning to my test case, I'm no longer getting the "double free" error. Can you try running the example? I can see that one thing that's changed on my side since I reported this is I updated my compiler:
I just tried again with clang 14.0.3, and still don't get an error. Hmmmm.... |
Absolutely. I just ran the example with the extra calls to re-add the same environment variable. That does cause double frees, as I would expect:
I don't have an environment variable named If you look at the ejection handler in that file, So if you insert the same memory cell multiple times, it'll get ejected multiple times, and it'll free multiple times. The handler getting called indicates that the ejected cell isn't being referenced, but it doesn't know if you've reused it. So if you are re-using values, then freeing when the handler gets called is the wrong behavior; you might use an atomic reference count in this situation (with Or, just wait till the table is torn down. If you don't register an ejection handler, it won't try to free the cells. As for reusing keys, it's the exact same mechanism-- if you want to be able to reuse them you can. If you don't want to ever free anything during the lifetime of the table, don't worry about registering a handle. But if you do want to free things, then you need to manage how to deal with items that might be duplicates. |
Somehow I didn't notice the registration of the ejection handler. This makes sense now. Obviously, this is not the exact code that led me to file this bug in the first place. If I come across the code that I was having trouble with, I will re-examine it. It's possible I was misusing the API. Thanks for your help. |
BTW, I found out why I was unable to reproduce the issue yesterday. I had commented out the code in hatrack that frees the memory. After undoing that change, I was able to reliably get the double free error. |
Thanks, if you do have any issues, just let me know. I have made sure this repo gets flagged, so I shouldn't miss it in the future. |
Hi John. I found the original code that led me to file this issue, and it's still crashing even though I don't think it should be. I'm going to re-open this ticket, though we may end up with a different one. I will look at how I can provide a test case that more accurately reflects the problem I'm still having. In the meantime, though, I'll mention the exact place where my code is crashing is at line 625 in Lines 617 to 628 in 1c66f8b
The problem is that I would have to dig deeper to figure out what |
Sounds good, I will take a look (tho maybe not till tonight or in the morning); I assume if you do register a callback it's fine? The callback doesn't need to free anything... I thought I'd made the callback optional, but my memory could also be hazy, I'm getting older :) |
I am setting a callback, yes. As I mentioned though, in the function I referenced above,
Join the club. (BTW, I appreciate the reference to Zork.) Here's a complete code sample that crashes. It's based on #include <hatrack.h>
#include <string.h>
#include <stdio.h>
#include <stdbool.h>
static void
envp_free_handler(hatrack_dict_t *unused, hatrack_dict_item_t *item)
{
void *key = item->key;
return;
}
int
main(int argc, char *argv[], char *envp[])
{
hatrack_dict_t *envp_dict;
uint64_t i;
char *env_eq; // Pointer to the equals mark.
char *env_key;
char *env_val;
char *p;
envp_dict = hatrack_dict_new(HATRACK_DICT_KEY_TYPE_CSTR);
hatrack_dict_set_free_handler(envp_dict,
(hatrack_mem_hook_t)envp_free_handler);
i = 0;
while (envp[i]) {
p = envp[i];
env_eq = strchr(p, '=');
env_key = strndup(p, env_eq - p);
env_val = strdup(env_eq + 1);
hatrack_dict_put(envp_dict, env_key, env_val);
hatrack_dict_remove(envp_dict, env_key);
i++;
}
hatrack_dict_delete(envp_dict);
return 0;
} There are two things in particular I want to point out in this example. First, I'm setting a free handler, but it doesn't actually free anything. However, it does access From what I understand about the API, this should be safe. My code isn't freeing (Some detailed API docs would be very welcome, BTW. 😄) I appreciate your assistance. |
This is useful, thank you! I'll get to it by morning.
I definitely didn't quite finish things up before going back to work, and
that includes documentation :( In addition to docs, I really should strip
everything down to what the core library should be, as opposed to the
comparative algorithms, and rm the stuff that never got finished.
I'll try to get to that stuff too. I probably will be using this library in
a new project again soon, so I'll keep this in mind.
…On Mon, Oct 2, 2023 at 1:24 PM Michael Allman ***@***.***> wrote:
Sounds good, I will take a look (tho maybe not till tonight or in the
morning); I assume if you do register a callback it's fine? The callback
doesn't need to free anything...
I am setting a callback, yes. As I mentioned though, in the function I
referenced above, dict is not the same as the dict my code inits, and
dict->free_handler is NULL for some reason... 🤔
I thought I'd made the callback optional, but my memory could also be
hazy, I'm getting older :)
Join the club. (BTW, I appreciate the reference to Zork.)
Here's a complete code sample that crashes. It's based on basic.c, but
pared down:
#include <hatrack.h>#include <string.h>#include <stdio.h>#include <stdbool.h>
static voidenvp_free_handler(hatrack_dict_t *unused, hatrack_dict_item_t *item)
{
void *key = item->key;
return;
}
intmain(int argc, char *argv[], char *envp[])
{
hatrack_dict_t *envp_dict;
uint64_t i;
char *env_eq; // Pointer to the equals mark.
char *env_key;
char *env_val;
char *p;
envp_dict = hatrack_dict_new(HATRACK_DICT_KEY_TYPE_CSTR);
hatrack_dict_set_free_handler(envp_dict,
(hatrack_mem_hook_t)envp_free_handler);
i = 0;
while (envp[i]) {
p = envp[i];
env_eq = strchr(p, '=');
env_key = strndup(p, env_eq - p);
env_val = strdup(env_eq + 1);
hatrack_dict_put(envp_dict, env_key, env_val);
hatrack_dict_remove(envp_dict, env_key);
i++;
}
hatrack_dict_delete(envp_dict);
return 0;
}
There are two things in particular I want to point out in this example.
First, I'm setting a free handler, but it doesn't actually free anything.
However, it does access item. Second, I'm removing the key after I put it.
From what I understand about the API, this should be safe. My code isn't
freeing env_key anywhere, including in the free handler. So my
expectation is that hatrack shouldn't free env_key either. Further, I'm
assuming that it's safe to reference item in the free handler.
(Some detailed API docs would be very welcome, BTW. 😄)
I appreciate your assistance.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABELGQJDZ4QNFKJENF2KLOLX5L2DLAVCNFSM6AAAAAAZNVRCXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBTGQ2DIMJYGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks. It's no rush, really. I'm just on a roll now with this hatrack stuff, but I need to put my attention elsewhere, too. |
Okay, I might take another couple days then, it's turning out to be a busy week :) |
Hi @viega. I thought I'd just follow up with you to see how you're feeling about this project. From my experience, hatrack is wonderful software. Its lack of API documentation makes its adoption more challenging than it need be. Adopters like me will struggle with problems that may come down to misunderstandings about how to use the API. Can you devote some time to shore up the documentation? It would help me debug my own issue here and decide whether this is an issue of misusing the API or an actual bug in hatrack. Cheers. (BTW, I'm using hatrack from a Swift project. Bet you never expected a Swift programmer to use hatrack.) |
@mallman: Thanks for reaching out again. We've been making good use of hatrack ourselves, and I've done a little bit of work on the library, but have done it all in a whole new repo, because of the other things I'm building around it. To my recollection, I was unable to reproduce your problem, and was using it in a way pretty consistent with what you said. My biggest problem in this repo had been fighting to be able to build reliably across all the architectures we're using. In the new repo, I migrated to Meson, which has been much easier than my lifelong fights with CMake and Autotools. If you have meson installed, I believe you should be able to check out: And then I believe you can just run A tiny bit of the other stuff in that repo that isn't built with that target may or may not be interesting to you. Primarily, there's a simple interface for garbage collecting. Right now it's fairly straightforward and it basically gives each thread its own heap, and isn't yet handling cross-heap pointers. But it is meant to help remove the burden of cross-language memory management... just call I'm not sure yet when I'm going to back-port the hatrack bits to here. I think it depends on where the other project ends up, because there are some advantages to having them together while developing, and I don't want to have to worry about syncing two versions of it. And on documentation, I 100% agree, though do note that, even though we are using it all, I still haven't really announced any of the work. What exactly would help you documentation wise? I can try to prioritize it. |
Hi @viega. I'm getting back to my hatrack code, using the libcon4m codebase. I'm running into some trouble with double-frees, but before I dive deeper into that I'd like to report an issue with |
Hi @viega. I have another issue to report for hatrack. Should I report it in this repo or in the libcon4m repo? Basically, it appears that distinct keys with identical hash values overwrite each other in a dict. |
Hey there! First, let me say that it's an intentional (and very common) choice with dictionaries to treat the hash value as a unique stand-in for the key, meaning if there is a collision, then the two keys will be treated as identical. This is incredibly common with production hash tables because, if you select the right hash algorithm, the odds of it happening are infinitesimal. For instance, using SHA-256 and truncating the result to a 128 bit key would result in a collision on average once every 2^64 hashes. However, for hash tables, it's common to skip the cryptographic hash function to save cycles. I selected XXH128 to package in because it's incredibly fast and seems to be collision resistant given non-malicious inputs. The hash speed should be about 10x faster than using SHA-256. But SHA-256 isn't actually particularly expensive. So I guess I'd ask:
Switching to SHA-256 would definitely guarantee that you wouldn't have collisions for all practical purposes (it would essentially require a break of SHA-256 to raise the odds of a collision to any meaningful value). As for repo, either place is fine. I have set both up to get me notifications :) |
Interesting. In the cases of the hash tables I've seen they either claim that distinct keys map to distinct values or I've made that assumption. At least, in the theory of hash tables I've seen there are two common approaches to handling hash collisions—chaining and open addressing. Both ensure that distinct inputs are always treated as distinct keys. If they break their API contract I wonder why they don't just change the API documentation to specify that keys with equal hash values are treated as equal keys? Then there would be no confusion around why a two different keys with the same hash value are treated as the same key. In my particular case of hash collisions, I'm using a custom hash that returns 0 for every input. This isn't part of any kind of real code. It's part of a unit test to validate assumptions. The assumption I'm trying to validate is that distinct values with identical hash codes are treated as distinct keys in an hatrack hash. Actually, this wasn't one of my original unit tests, because I usually don't test a library API. I caught this serendipitously when I read a line in the source code about keys being compared by hash. That made me want to test this certain assumption of mine. My unit tests are for my Swift wrapper around hatrack. I'm still getting double-free related crashes for one specific test case: that where I call Thanks for clearing this up, though this does make me a little nervous about using hatrack with custom hashes. I've seen one hash table implementation which avoided the issue of resolving hash collisions by restricting hash keys to 64-bit integers, and guaranteeing that those keys never collide. One question... if I use the built-in Thank you. |
For custom keys, I'm using Swift's standard hashing algorithm. That algorithm is not documented by the API, so clients can't know how likely it is for two different keys to have identical hashes. The trouble with requiring clients to implement a high quality custom hash is that I don't want to make that part of my API. I want clients of my API to use any hash algorithm for Swift data structures so long as identical keys have identical hashes. To put it another way, I want to make it as easy as possible for clients to use my hatrack wrapper. For custom key types, the easiest thing for clients to do is use the built-in hash value algorithm. |
Well, the default Hatrack hash table uses open addressing. There's a sophisticated caching mechanism for linear probing I based somewhat loosely on hopscotch hashing (hopscotch won't work for something lock-free directly). The details of which are documented in I do understand the confusion now, though, and I should be able to explain it. But first, let me try to level set us (I'm sure you already understand the below, just want to be clear and build up to the communication issue).
So the major confusion we've had here is that really, we can see TWO types of collisions:
#1 is dealt with via the probing strategy (or chaining, but that generally has a big negative impact on performance). As for #2, I'm sure there are hash tables somewhere that always do a deep-compare of objects, but that's not going to be fast (especially for large objects), and in parallelized environments, is somewhat impractical anyway. Of course, a type 2 collision is always a type 1 collision too, but type 1 collisions are generally not type 2 collisions ever in practice. Hope that clears things up. In terms of using int values directly to guarantee no type 2 collisions... you COULD do that, but I wouldn't recommend it, because you'll definitely pay a price. Most hash tables I've seen that started out that way and then saw use at scale ended up changing because it had a noticeable, measurable impact. The place where linear probing can fail is if your hash values are NOT uniformly distributed. For instance, if you are hashing integers, these tend to NOT be particularly evenly distributed in real world scenarios, leading to lots of collisions and thus reduced performance. There are alternative strategies. You could also have the upper 64 bits be the integer directly, and have the other 64 bits come from a hash if you like (I have tried that, and the performance is fine as long as you pick the upper bits; if you replace the lower ones you're back to abnormally high collision rates). Or, you could take the number, and AES-encrypt it, which guarantees no collisions. But I'd still expect that to be slower than XXH-128. But the world does generally feel comfortable with the probabilities. And I know them well in the cryptography world-- for instance, you'll notice that the most common AES mode today, AES-GCM has a probabilistic bound... and also has my name as a co-author. For instance, let's assume I'm adding 1M hash keys a second continuously for a year. That's 31,536,000 * 1,000,000. Clearly, that's an absolutely unrealistically high load by a huge margin. Yet, it would take that workload more than half a million years to get to the point where there was even a 50% chance that there were ever two DIFFERENT values used that yielded the same hash key. In practice, nobody should make a tradeoff based on worrying about collisions on 128-bit values, as long as the hash is trusted enough. There, SHA-256 definitely fits the bill. And, there are plenty of non-cryptographic hashes that are widely deployed for this use case, XXH being high among them. But, I will admit that XXH being non-cryptographic, it's more reasonable to think the uniformity assumption might not apply. If you really wanted to be conservative, paying the price to use SHA-256 would be worth it, especially if you memoize, and this is something I've definitely considered. Now, as for trying to provide a simple interface to API clients, I have wrapped Hatrack in languages with strong type systems, and been able to do that. The rest of libcon4m also does the same thing. Basically, even though it looks like the pointer hash and int hash are different, on 64-bit machines, they are 100% the same. In most contexts, either you have a primitive value like an int or float, or else the memory address of storage is generally going to be a fine stand-in for equality. That breaks down when you're in an environment where different instantiations need to test as equal when they are structurally equivalent, and when it's not unlikely to have multiple objects that are equal. The place where that tends to be important is with strings. But whatever the case, it's better to hash the whole object then to compare copies. That's particularly true because you can memoize the hash. The So when you use the off-the-shelf options, you are getting one of the two things: Now, it is on my (very long) TODO list to allow the length of the key to be a second parameter. For instance, UTF-32 strings currently need either the custom hash option or to use the by-reference compare, because they almost always contain lots of null characters, resulting in But my advice is to automatically select the function based on swift's type system. For instance, in other languages, here's what I did: As for the I'd be happy to find a time to pair with you on it if you did want to look at it together. And I hope this is all helpful! |
Is this work public? I'd like to see.
In my own use case, I'm using a hash table to canonicalize database entities by primary key (also called "uniq-ing"). The "primary key" in the codebase is actually a pair consisting of a library identifier and database primary key, because the app can work with entities from more than one library database with the same schema concurrently. These identifiers are frequently constructed outside the context of the persistence environment itself. Comparison by value is essential. I'm using the standard Swift mechanism for computing their hash, but I could easily customize that. Also, ATM I am not memo-izing the hash, even though I could. (The objects are immutable, after all.) I don't recall if I've tested various schemes for performance. I may revisit that topic. Regarding your advice for selecting a hash function:
I'm doing that for integer dictionaries, using the
I'm converting Swift strings (which are unicode) to UTF-8 C strings (NULL terminated) and using
The Swift wrapper doesn't hash anything itself. For custom key types, it requires the client to provide their own hash value. It then passes that hash value directly to a
I'm going to take a closer look at my problem with the
That would be fantastic, though I want to spend a little more time doing my own legwork.
This is proving to be very fruitful for me, thank you. I hope you find this helpful as well. |
Sure, here's a link to a wrapping for Nim: https://github.com/crashappsec/nimutils/blob/dev/nimutils/crownhash.nim It was a bit quick and dirty, but it is working well for us. Many programming languages have an extensible facility for custom types to provide hashes so they can be used as dictionary keys (Python is one where I used to be very familiar with the details of the implementation). While they often use fine algorithms, they do tend to produce 64-bit hash values. Unfortunately, Swift falls in that category, per my googling. But those languages still do use it for equality testing (as Swift seems to). This is not in my margin of safety personally. When using 64 bit hashes, no matter how good the function is, you're down to the equivalent of only 32 bits of security (meaning, a collision is expected. That's somewhere near 4B hashes or so... so for most apps the odds are going to be 1 in many millions, but if I can imagine there might be cases that use that many keys, then I am not personally comfortable. But again, it does seem like Swift's team itself is fine with it, and you're probably fine to use that as the hash key, as they seem to do internally. Anyway, I'm definitely enjoying discussing, and happy to find time to help whenever. Just let me know when you want to engage. |
I haven't quite nailed down the exact circumstances in which this bug occurs, but I think you can reliably reproduce it by modifying
examples/basic.c
to add multiple calls tohatrack_dict_put(envp_dict, env_key, env_val)
in the same loop iteration when populatingenvp_dict
, like so:Obviously, this depends on the number of environment variables. I have about 40. Just add more calls
hatrack_dict_put(envp_dict, env_key, env_val)
if you still don't trigger the bug.I think this is related to the value of
HATRACK_RETIRE_FREQ
, because the bug occurs in the function call tommm_empty()
inmmm_retire()
inmmm.c
.I'm running this on an M1 macOS 13.4 with Xcode 14.3.1 clang:
I configured hatrack from
scripts/config-debug
.Thank you!
The text was updated successfully, but these errors were encountered: