-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak when restoring a Backup #4890
Comments
Are the .xconf files available for reference? There are various options in eXist's Lucene-based full text and range index configuration files. It might be helpful to know which aspects were in play in this case. |
@adamretter added the If you need any additional info, just let me know. |
Concerned that this issue might affect the release of a forthcoming update to my application, which depends upon the facets and fields feature and fixes included in eXist 6.2.0, I performed some tests to try to assess if my application is affected. Starting from a fresh, stock instance of eXist 6.2.0 (with all defaults, including 2GB Xmx), I carried out these steps:
In analyzing the heap dump, we determined that after indexing is completed, the instance retained about 40MB of memory for Lucene indexes, out of the total 392 MB of retained memory used - about 10%. Based on these results, it appears that Lucene is holding onto memory that it should release after indexing is complete. While the memory withheld did not cause eXist to crash, it appears the findings are consistent with Dario’s findings, and may demonstrate that a leak is present. I would note that @dariok's dataset is about 8x larger than mine. (Mine weighs in at 4.37 GB in 37k XML documents, whereas Dario's is 35 GB with 40k documents). His also has a more extensive and granular set of index definitions, including |
I did some more HEAP dump digging and I saw a lot of objects held in the @adamretter I there a way to access this cache using JMX and try to flush this one? |
@reinhapa As per my screenshot at the top I also saw objects in the BTreeCache of about ~1GB retained size, however I believe that to be correct and inline with the sizing of the caches in the I don't know of a way to clear the |
I was able to reproduce the memory leak using the original dataset provided by @dariok. With enough resources, about 10 GB of RAM, the process did finish successfully. After restoring the entire dataset of 35 GB, containing 26874 XML files (including .html, .xsl, .svg), almost 5 GB of RAM retained. There is ~59000 instances of I was then wondering whether I was missing +3000 documents until I discovered that The original <triggers>
<trigger event="update" class="org.exist.collections.triggers.XQueryTrigger">
<parameter name="url" value="xmldb:exist:///db/apps/path/to/module.xql"/>
</trigger>
</triggers> With the My analysis indicates that TriggerStates are held in memory indefinitely and are the cause of the leak. This is especially interesting as the events that were triggered are not the ones that are defined and thus one would not expect them to be fired at all. |
Just to be clear: Based on my findings, it is unlikely that the changes to fields is related to this issue. |
@joewiz does your dataset define triggers? |
@line-o Not in production, but yes in the version that is being prepared to deploy to production: https://github.com/HistoryAtState/frus/blame/master/volumes.xconf#L2-L9. This is the version that I used in the results I shared above. |
@line-o @adamretter when would/should such a trigger event be processed/completed? |
@reinhapa I don't know / have yet to familiarise myself with the Trigger implementation. |
These are Thread Local instances and the intention is that they should be cleared when each Trigger completes - see |
Hmm, when I understand @line-o corectly the import succeeded when using enough memory so far and the Range and Lucene Index free up there memory after the restore right? If the memory at the end is on the "normal" levels again, I would not directly call that part a memory leak in the classic sense but an uncontrolled temporary overshoot (that we definitly need to look into). @adamretter those triggers should be processed and removed afterwards right? As for the Fulltext and Range index: Can it be, that those are note "fast" enough in processing and we experience some sort of a memory back pressure there? |
@reinhapa correct, with enough resources restoring the backup succeeds with or without triggers configured and only TriggerStates are not removed from memory afterwards. |
Please note that after removing the trigger – while leaving the rest of the configuration untouched – the restore process went smooth and memory consumption stayed within lower bounds as far as I could see in @line-o's reports during testing. The Lucene-based indexes thus seem to behave as expected under normal conditions, i.e. if memory is not under pressure due to other memory leaks. |
It is very strange that we are seeing different things from the same process. I will attempt to reproduce my findings again when I have a moment. If I can reproduce them, perhaps I should post a video, as presumably there must be something different between our approaches to reproducing this. |
Maybe you get different results because you used |
@wolfgangmm That's an interesting idea! I didn't understand from @line-o's report that that was what he had done. I will try with and without |
It has been 4 months since I reported my findings. Was anybody able to falsify or reproduce them in the meantime? |
ping... |
I think there are clearly two different things being reported in this issue:
I will try and find the Heap Dump that @dariok sent me previously and take a look into (1) as soon as I have some free time. The first thing I will do is provide a checksum of the heap dump file, and clearer steps to reproduce the issue. For (1) if we assume that @line-o used the same Heap Dump from @dariok as I did, then I think he must have taken different steps to me as we got quite different results. If @line-o could provide a checksum of his heap dump file (so we can ensure we are using the same heap dumps), and exact steps to reproduce what he is seeing, I could then look into that also... |
Hi Adam, would you be available to discuss this subject in an upcoming community call? |
I am not using a heap dump, but reproduced the issue by restoring the backup into a instance with 16 GB of RAM
|
@adamretter im my tests using the backup file given to my by @line-o and @dariok I could observe only memory issues, when there where triggers activated with the backup file. I would very much like to discuss the potential further tests and possible solutions in a upcoming community call as suggested by @dizzzz. I find it hard to discuss it here. |
I have added the exact details now at the top of this thread in the description of the issue.
@dizzzz No, I will not attend the community calls. I stopped attending them as it became a toxic environment where I no longer felt welcome; I raised that issue with several people, and nothing has been done about it. If you want to discuss this issue with me, you are welcome to contact me directly. @reinhapa You are also welcome to contact me directly. I think as I stated above, there are clearly two different technical issues. The issue that I opened is about the memory leak with Lucene in eXist-db versions after 6.0.1. The two heap dump files that I was provided that clearly show this are detailed at the top of this issue. I think the potential other issue that @line-o identified should be moved into its own new issue as it is different to this one, and conflating them in this single issue thread just leads to confusion. |
I am all for it @adamretter, will do. For the record: I can not reproduce the issue you describe. |
Yes, I was already clear on that. I am not sure why, as I and others can reproduce it. Anyway, I will work on some way of presenting further evidence... perhaps a screen recording showing the issue |
I haven't seen anyone else commenting here that they were able to reproduce the issue. |
Well there is clearly two people reporting this here - me and Dario. In addition there are some customers of ours (Evolved Binary) who won't comment on GitHub issues... |
The issue as described here did not lead to a memory leak in my testing. The trigger definition needs to be added, which is the case in @dariok's original backup. |
It does lead to a memory leak in my and others testing even without the triggers being defined. I don't see how repeating yourself that you can't see the issue helps anyone at all. Let's move forward please - as requested please feel free to open a separate issue for the separate thing you are seeing @line-o |
@dariok since you were the first person to notice a problem when you restored your backup: did you retest without triggers? |
@adamretter It is correct that I am on holidays and I won't be able to open a separate issue this week. I believe I will find time to do this next week, however. If you want to publish your PR in the meantime @dizzzz and @reinhapa may already be able to review the fix and the issue number can then be added afterwards. |
@line-o we will await your issue next week |
@line-o If you are back from your holidays now, would you be able to open an issue please? |
@line-o Hope you are ok... can we get an update please? |
@line-o but that's just not true... two separate groups have been able to reproduce this! (see: #4890 (comment)) |
@dariok Please give feedback if you were able to reproduce the issue with current develop-6.x.x |
I have executed 4 restores into clean DBs, especially with the biggest files from my dataset. As far as I can currently tell (restriction: see #5567 which meant that I could not restore the full 23G of data I currently have in one go), restores run through without a problem. |
@dariok That was done with develop-6.x.x, not 6.3.0, correct? |
The |
@line-o What about all of the time and effort myself and my colleagues put into reproducing this and confirming it was an issue... Does that count for nothing? Why are our reports not considered valid? |
When restoring a clean instance of eXist-db 6.2.0 from a backup we are able to observe what we believe to be a memory leak in the Full-Text and Range indexes of eXist-db; to be clear I am referring to the implementations of both of those indexes that are backed by Lucene.
The database backup is approximately 40,000 XML documents (35GB).
Having configured the JVM for eXist-db to
-XX:+HeapDumpOnOutOfMemoryError
and-XX:+ExitOnOutOfMemoryError
, we have captured the heap dumps from the JVM and analyzed them with YourKit Java Profiler.Heap Dump 1
@dariok provided me with the file:
java_pid236689.hprof
(md5sum:54c669ddba7532bee2449922f2fba1d5
)Shows that of the 2.9GB available to eXist-db, 1.1GB is retained (at the time that eXist-db ran out of memory) by the
RangeIndex
.Heap Dump 2
@dariok provided me with the file:
java_pid318495.hprof
(md5sum:9a99f30986fe2579b8c54efe6547e05b
)Re-running with a clean eXist-db instance and restoring the same backup again but with 6GB of memory available to eXist-db, we see that 2.5GB and 1.1GB are retained (at the time that eXist-db ran out of memory) by the
LuceneIndex
andRangeIndex
respectively.Observations
For reference, the
collection.xconf
in use looks like:The text was updated successfully, but these errors were encountered: