Problems with large cdx files #13

thomaspreeceBBC · 2017-12-15T09:23:26Z

So I know I am most likely miss using outback but I have a couple of very large cdx files I'd like to move over to outback however posting them to outback causes outback to slowly consume more and more memory until it runs out and crashes. The command I'm using to post the data is:

curl -o upload.txt --progress-bar -X POST -T records.cdx http://localhost:8080/myindex

The text was updated successfully, but these errors were encountered:

ato · 2017-12-15T09:43:21Z

That's odd. It's supposed to read the input incrementally. That said I think most people have been using it in an incremental fashion with one POST per WARC processed. Therefore loading enormous numbers of records in the one request hasn't been tested much. Do you get any sort of error message or stack trace when it crashes?

It might be a limitation of RocksDB, I had assumed write batches didn't require much memory to track but that could be a false assumption and we might need to break up very large data loads into multiple batches.

ato · 2017-12-15T09:46:12Z

Also maybe also check on curl's memory usage. I seem to recall curl buffering the input request in memory rather than streaming it, in which case we might also need to find something other than curl to load the files too.

thomaspreeceBBC · 2017-12-15T10:04:52Z

So this is the output it gives after it crashes:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
./run_outback.sh: line 22: 17432 Aborted                 (core dumped) java -XX:MaxMetaspaceSize=1G -XX:MaxDirectMemorySize=1G -Xmx1G -Xss512M -jar outbackcdx/target/outbackcdx*.jar -p 10002 -d /data/webarchives/outbackcdx-index/

As you can see I've been messing about with trying to restrict memory use on JVM to no avail but the error message is the same if just run with the normal command. (I have extremely limited knowledge of Java so can't provide much else without direction)

As for curl, it does crash if you use the --data-binary flag as that causes curl to try to copy the data entirely into memory before posting it to the server. Using the -T flag instead avoids this and monitoring it's memory usage it's insignificant.

ato · 2017-12-15T10:40:39Z

Ah. Yeah, that doesn't look like a Java memory exhaustion (which would be some variant on "OutOfMemoryError") but rather a C++ allocation failing which definitely hints at RocksDB as the culprit. Since RocksDB allocates outside the Java heap, the Java heap options will also probably have no effect on it.

This problem probably needs some more thought but I've pushed a branch batch-size-limit: acc58e0
which just does the obvious workaround of creating a new RocksDB WriteBatch every 32k records (I just plucked that number of the air.. it's almost certainly not optimal). Maybe give that a try? Unfortunately doing so breaks the atomicity of the requests, but that probably doesn't matter for bulk data loads.

Of course the easy workaround is breaking the big file up into pieces (e.g. with split -l 1000000 or a POSTing from a custom script) and loading them one by one but it would be nice if OutbackCDX could just cope with the large POST.

thomaspreeceBBC · 2017-12-15T11:26:14Z

So I've just tested the new branch and it must be something else as it still crashes out after filling the 32GB of system memory. Is there anything I can give to java to get it to dump out some more useful information for you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with large cdx files #13

Problems with large cdx files #13

thomaspreeceBBC commented Dec 15, 2017

ato commented Dec 15, 2017

ato commented Dec 15, 2017

thomaspreeceBBC commented Dec 15, 2017

ato commented Dec 15, 2017

thomaspreeceBBC commented Dec 15, 2017

Problems with large cdx files #13

Problems with large cdx files #13

Comments

thomaspreeceBBC commented Dec 15, 2017

ato commented Dec 15, 2017

ato commented Dec 15, 2017

thomaspreeceBBC commented Dec 15, 2017

ato commented Dec 15, 2017

thomaspreeceBBC commented Dec 15, 2017