-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with large cdx files #13
Comments
That's odd. It's supposed to read the input incrementally. That said I think most people have been using it in an incremental fashion with one POST per WARC processed. Therefore loading enormous numbers of records in the one request hasn't been tested much. Do you get any sort of error message or stack trace when it crashes? It might be a limitation of RocksDB, I had assumed write batches didn't require much memory to track but that could be a false assumption and we might need to break up very large data loads into multiple batches. |
Also maybe also check on curl's memory usage. I seem to recall curl buffering the input request in memory rather than streaming it, in which case we might also need to find something other than curl to load the files too. |
So this is the output it gives after it crashes:
As you can see I've been messing about with trying to restrict memory use on JVM to no avail but the error message is the same if just run with the normal command. (I have extremely limited knowledge of Java so can't provide much else without direction) As for curl, it does crash if you use the |
Ah. Yeah, that doesn't look like a Java memory exhaustion (which would be some variant on "OutOfMemoryError") but rather a C++ allocation failing which definitely hints at RocksDB as the culprit. Since RocksDB allocates outside the Java heap, the Java heap options will also probably have no effect on it. This problem probably needs some more thought but I've pushed a branch Of course the easy workaround is breaking the big file up into pieces (e.g. with |
So I've just tested the new branch and it must be something else as it still crashes out after filling the 32GB of system memory. Is there anything I can give to java to get it to dump out some more useful information for you? |
So I know I am most likely miss using outback but I have a couple of very large cdx files I'd like to move over to outback however posting them to outback causes outback to slowly consume more and more memory until it runs out and crashes. The command I'm using to post the data is:
The text was updated successfully, but these errors were encountered: