Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with large cdx files #13

Open
thomaspreeceBBC opened this issue Dec 15, 2017 · 5 comments
Open

Problems with large cdx files #13

thomaspreeceBBC opened this issue Dec 15, 2017 · 5 comments

Comments

@thomaspreeceBBC
Copy link

So I know I am most likely miss using outback but I have a couple of very large cdx files I'd like to move over to outback however posting them to outback causes outback to slowly consume more and more memory until it runs out and crashes. The command I'm using to post the data is:

curl -o upload.txt --progress-bar -X POST -T records.cdx http://localhost:8080/myindex
@ato
Copy link
Member

ato commented Dec 15, 2017

That's odd. It's supposed to read the input incrementally. That said I think most people have been using it in an incremental fashion with one POST per WARC processed. Therefore loading enormous numbers of records in the one request hasn't been tested much. Do you get any sort of error message or stack trace when it crashes?

It might be a limitation of RocksDB, I had assumed write batches didn't require much memory to track but that could be a false assumption and we might need to break up very large data loads into multiple batches.

@ato
Copy link
Member

ato commented Dec 15, 2017

Also maybe also check on curl's memory usage. I seem to recall curl buffering the input request in memory rather than streaming it, in which case we might also need to find something other than curl to load the files too.

@thomaspreeceBBC
Copy link
Author

So this is the output it gives after it crashes:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
./run_outback.sh: line 22: 17432 Aborted                 (core dumped) java -XX:MaxMetaspaceSize=1G -XX:MaxDirectMemorySize=1G -Xmx1G -Xss512M -jar outbackcdx/target/outbackcdx*.jar -p 10002 -d /data/webarchives/outbackcdx-index/

As you can see I've been messing about with trying to restrict memory use on JVM to no avail but the error message is the same if just run with the normal command. (I have extremely limited knowledge of Java so can't provide much else without direction)

As for curl, it does crash if you use the --data-binary flag as that causes curl to try to copy the data entirely into memory before posting it to the server. Using the -T flag instead avoids this and monitoring it's memory usage it's insignificant.

@ato
Copy link
Member

ato commented Dec 15, 2017

Ah. Yeah, that doesn't look like a Java memory exhaustion (which would be some variant on "OutOfMemoryError") but rather a C++ allocation failing which definitely hints at RocksDB as the culprit. Since RocksDB allocates outside the Java heap, the Java heap options will also probably have no effect on it.

This problem probably needs some more thought but I've pushed a branch batch-size-limit: acc58e0
which just does the obvious workaround of creating a new RocksDB WriteBatch every 32k records (I just plucked that number of the air.. it's almost certainly not optimal). Maybe give that a try? Unfortunately doing so breaks the atomicity of the requests, but that probably doesn't matter for bulk data loads.

Of course the easy workaround is breaking the big file up into pieces (e.g. with split -l 1000000 or a POSTing from a custom script) and loading them one by one but it would be nice if OutbackCDX could just cope with the large POST.

@thomaspreeceBBC
Copy link
Author

So I've just tested the new branch and it must be something else as it still crashes out after filling the 32GB of system memory. Is there anything I can give to java to get it to dump out some more useful information for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants