-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default field splitting behaving inconsistently #60
Comments
Thanks again for taking the time to create a small test case. I am having some trouble reproducing this at head (which I think should be at the same commit as 0.4.1 on crates.io).
As for why you had trouble shrinking this case further: frawk determines field and line boundaries for the default field splitter in large batches, sometimes for multiple lines in parallel. Bugs in the code that does the splitting might only show up when you feed it multiple lines of sufficient length. This is on a linux machine. Note that I've only tested this on x86; I plan to get an M1 machine eventually but haven't gotten around to it, and some of the whitespace-splitting code is pretty architecture-specific. |
Oh ok. Is there something I can do to provide further debug details? This is what I used to install sudo apt install rustc # version 1.47.0
cargo install frawk --no-default-features --features use_jemalloc,allow_avx2 I'm on Ubuntu 20.04.1 LTS, x86_64. |
Gotcha. I confirmed that that feature combination still works for me. I'll see about setting up a VM tonight or tomorrow with your exact OS and rustc version (I use rustup.rs to get more recent versions of rustc/cargo). I'd honestly be pretty surprised if this was causing an issue, but it shouldn't be too difficult to rule it out. |
Hmmm, yeah. I get the exact same output I did before running a fresh VM with Ubuntu 20.04 with rustc 1.47 (albeit without jemalloc, but again I doubt that's the issue here). The only other thing I can think of is if I'm not pasting the data correctly. Maybe post a hash of the data file, or put it in a gist? |
I purged $ md5sum fields.txt
418e730b6b83f9400ea813042f9003a0 fields.txt
# not sure if this will help
$ md5sum frawk
02efa88db6aea820be1635b98640fc61 frawk If it is possible, could you release a generic Linux binary or |
Sure thing. I should be able to get to that in the next day or two. Thanks for bearing with me on this. |
This is relevant to the case mentioned in #60.
I got a docker setup to reliably produce binaries for ubuntu 20.04, posted here. Let me know if this helps. |
I tried both the 20.04 variations, I get this error for anything I try: $ ./frawk 'BEGIN{print "hi";}'
Illegal instruction (core dumped) Also, I don't remember the option, but I thought there is a way to create a Here's one thread I found: https://www.reddit.com/r/rust/comments/eaa8f7/building_compatible_linux_binaries/ |
Thanks for the tips! I haven't been using |
I'm still getting |
I have a theory about what might be going on. I should be able to test it out on my own later, but in the meantime could you confirm if your CPU has AVX2 support (see "CPUs with AVX2" here) ? I'm thinking that the runtime fallback to SSE2 may not be working. That would explain why you don't get SIGILL when compiling from source; I should be able to confirm later by reproducing the initial bug while compiling without AVX2. |
I checked |
Okay, I have reproduced the initial issue, there is a bug in the non-AVX2 implementation of whitespace splitting. I'm going to focus on that bug first. There's a further bug, though, which is why you got SIGILL. I just re-checked my my code and there isn't anything obviously wrong with the runtime feature detection (in that, I'd expect that you would have gotten the same, buggy, output when running the pre-built binary). That issue could take longer to track down, and it's possibly an issue with one of frawk's dependencies. |
This ended up being the root cause of #60. Most of this change is refactoring to ensure that all available implementations run during testing, not just AVX2. That'll ensure we can avoid regressions in the future.
Update: I have a fix tested for the initial bug on the |
I haven't been able to compile from the I'll wait for the crate to be updated, that works for me. |
quick update: I've made progress on the To your point on not being able to build from git directly, I suspect that is related to that "second problem" I mentioned above. I have some preliminary changes on the same branch that I think will help, but I'll only be able to verify once I get a QEMU setup without AVX2 (I'm afraid my last computer without AVX2 isn't functioning). |
Had to take more time away than expected. I started up work again on the |
Cool, thanks for the update. Take your time though :) I don't know Rust, otherwise I'd have tried to help fix issues. This is a cool project and I'm interested in writing blog post/book about it. Some times, I see users on stackoverflow working with GBs of data, they'll certainly benefit from a faster tool. |
After some more reading, I think this is mostly a matter of passing the right compiler flags. I'll follow up and try and post binaries with minimal dependencies, but for now I'm fairly confident that building from source as you have been doing so far should work, once the aforementioned bug fixes have been merged. Thanks again for your patience. I'll plan to close this issue out in the next few days and file other issues for follow-up items. |
* Fix non-AVX2 whitespace splitting. This ended up being the root cause of #60. Most of this change is refactoring to ensure that all available implementations run during testing, not just AVX2. That'll ensure we can avoid regressions in the future. * Add coverage, start debugging generic CSV * fix generic impl for csv quote masks * refactor, test bytes splitter * improve practices around runtime detection of cpu features * different take on dynamic cpu feature selection * add native back in to rust flags
Alright. I've bumped the version to 0.4.2; I think if you cargo install as you did initially, things should work. I've opened #62 to track the remaining work. |
I'm getting this error:
|
some brief searching around suggests that this requires a more recent version of cargo. Possibly due to the recent version bump for the cranelift crates. I'll try and confirm in a bit in a vm. |
I've confirmed this is due to the cranelift version bump. It's not clear to me if there are any workarounds easier than installing a more recent version of cargo via rustup. |
I installed the newer version using rustup and it works 👍 |
The failure also seems to depend on the length of the input lines or something like that, which is why I have those long
x
andy
in the input, couldn't find a simpler failing case.The text was updated successfully, but these errors were encountered: