Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak when restoring a Backup #4890

Open
adamretter opened this issue Apr 27, 2023 · 49 comments
Open

Memory Leak when restoring a Backup #4890

adamretter opened this issue Apr 27, 2023 · 49 comments
Assignees

Comments

@adamretter
Copy link
Contributor

adamretter commented Apr 27, 2023

When restoring a clean instance of eXist-db 6.2.0 from a backup we are able to observe what we believe to be a memory leak in the Full-Text and Range indexes of eXist-db; to be clear I am referring to the implementations of both of those indexes that are backed by Lucene.

The database backup is approximately 40,000 XML documents (35GB).

Having configured the JVM for eXist-db to -XX:+HeapDumpOnOutOfMemoryError and -XX:+ExitOnOutOfMemoryError, we have captured the heap dumps from the JVM and analyzed them with YourKit Java Profiler.

Heap Dump 1

@dariok provided me with the file: java_pid236689.hprof (md5sum: 54c669ddba7532bee2449922f2fba1d5)

dump1-part1
dump1-part2

Shows that of the 2.9GB available to eXist-db, 1.1GB is retained (at the time that eXist-db ran out of memory) by the RangeIndex.

Heap Dump 2

@dariok provided me with the file: java_pid318495.hprof (md5sum: 9a99f30986fe2579b8c54efe6547e05b)

dump2
Re-running with a clean eXist-db instance and restoring the same backup again but with 6GB of memory available to eXist-db, we see that 2.5GB and 1.1GB are retained (at the time that eXist-db ran out of memory) by the LuceneIndex and RangeIndex respectively.

Observations

  1. The problem is not reproducible with eXist-db 6.0.1.
  2. The git history shows that the only changes to those indexes were the PR Field perf revised #4541 which first shipped in eXist-db 6.1.0.

For reference, the collection.xconf in use looks like:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:mets="http://www.loc.gov/METS/" xmlns:meta="https://github.com/dariok/wdbplus/wdbmeta" xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <lucene>
      <analyzer id="std" class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

      <text qname="tei:persName"/>
      <text qname="tei:placeName"/>
      <text qname="tei:orgName"/>
      <text qname="tei:title"/>
      <text qname="tei:text"/>
      <text qname="tei:note"/>
      <text qname="tei:p"/>
      <text qname="tei:ab"/>
      <text qname="tei:opener"/>
      <text qname="tei:closer"/>
      <text qname="tei:postscript"/>
      <text qname="tei:cell"/>
      <text qname="tei:label"/>
      <text qname="tei:item"/>
      <text qname="tei:quote"/>
      <text qname="tei:head"/>
      <text qname="tei:l"/>
      <text qname="tei:w"/>

      <inline qname="tei:unclear"/>
      <inline qname="tei:supplied"/>
      <inline qname="tei:gap"/>
      <inline qname="tei:ex"/>
      <inline qname="tei:hi"/>
      
      <ignore qname="tei:listPerson"/>
      <ignore qname="tei:listPlace"/>
      <ignore qname="tei:listOrg"/>
      <ignore qname="tei:listBibl"/>
      <ignore qname="tei:note"/>

      <ignore qname="tei:pb"/>
      <ignore qname="tei:fw"/>
      <ignore qname="tei:lb"/>
      <ignore qname="tei:pb" /> 
      <ignore qname="tei:pc"/>
    </lucene>
    <range>
      <create qname="mets:file">
        <field name="mets-file-id" match="@ID" type="xs:string"/>
      </create>
      <create qname="meta:file">
        <field name="file-path" match="@path" type="xs:string"/>
      </create>
      <create qname="@ref" type="xs:string"/>
      <create qname="@cRef" type="xs:string"/>
      <create qname="@type" type="xs:string"/>
      <create qname="@place" type="xs:string"/>
      <create qname="tei:rs">
        <field name="rs-type" match="@type" type="xs:string"/>
        <field name="rs-ref" match="@ref" type="xs:string"/>
      </create>
      <create qname="@when" type="xs:date"/>
      <create qname="@notBefore" type="xs:date"/>
      <create qname="@notAfter" type="xs:date"/>
      <create qname="tei:w" type="xs:string"/>
    </range>
  </index>
</collection>
@adamretter adamretter changed the title Memory Leak in eXist-dbLucene) Full-Text and Range Lucene Index Memory Leak in eXist-db's (Lucene) Full-Text and Range Indexes Apr 27, 2023
@joewiz
Copy link
Member

joewiz commented Apr 27, 2023

Are the .xconf files available for reference? There are various options in eXist's Lucene-based full text and range index configuration files. It might be helpful to know which aspects were in play in this case.

@dariok
Copy link

dariok commented Apr 28, 2023

@adamretter added the collection.xconf above. For one of my test runs, I reduced that one by removing some (more or less random) parts from both the full text and the range index config but that did not seem to change anything.

If you need any additional info, just let me know.

@joewiz
Copy link
Member

joewiz commented May 13, 2023

Concerned that this issue might affect the release of a forthcoming update to my application, which depends upon the facets and fields feature and fixes included in eXist 6.2.0, I performed some tests to try to assess if my application is affected.

Starting from a fresh, stock instance of eXist 6.2.0 (with all defaults, including 2GB Xmx), I carried out these steps:

  1. Stored all history.state.gov collection configuration files (captured from /db/system/config on a complete hsg-project system, as shown in the attached zip file to ensure that all indexes are configured before any data is stored.
  2. Uploaded all hsg-project repositories via WebDAV to store and index the data. (I then confirmed via monex's indexing pane that a sample facet index was successfully created.)
  3. Triggered a java heap dump via jmap -dump:format=b,file=/tmp/heapdump.bin <pid>.

In analyzing the heap dump, we determined that after indexing is completed, the instance retained about 40MB of memory for Lucene indexes, out of the total 392 MB of retained memory used - about 10%.

Based on these results, it appears that Lucene is holding onto memory that it should release after indexing is complete. While the memory withheld did not cause eXist to crash, it appears the findings are consistent with Dario’s findings, and may demonstrate that a leak is present.

I would note that @dariok's dataset is about 8x larger than mine. (Mine weighs in at 4.37 GB in 37k XML documents, whereas Dario's is 35 GB with 40k documents). His also has a more extensive and granular set of index definitions, including <tei:w> - i.e., every individual word - whereas mine is at the <tei:div> (article) level.

@reinhapa reinhapa self-assigned this Jun 23, 2023
@reinhapa
Copy link
Member

I did some more HEAP dump digging and I saw a lot of objects held in the BTreeCache and references of that in that ThreadLocal
image

Here are some more details:
image

@adamretter I there a way to access this cache using JMX and try to flush this one?

@adamretter
Copy link
Contributor Author

adamretter commented Jun 28, 2023

@reinhapa As per my screenshot at the top I also saw objects in the BTreeCache of about ~1GB retained size, however I believe that to be correct and inline with the sizing of the caches in the conf.xml files.

I don't know of a way to clear the BTreeCache via JMX (see the: org.exist.management.impl package), however it should not be too hard to expose something if you need it.

@line-o
Copy link
Member

line-o commented Jul 5, 2023

I was able to reproduce the memory leak using the original dataset provided by @dariok. With enough resources, about 10 GB of RAM, the process did finish successfully. After restoring the entire dataset of 35 GB, containing 26874 XML files (including .html, .xsl, .svg), almost 5 GB of RAM retained.

Screenshot 2023-07-04 at 14 24 36

There is ~59000 instances of TriggerStatePerThread$TriggerState.
Since I can see that there is two instances for each restored item that is ~29500 CREATE_DOCUMENT triggers - one for BEFORE and for the AFTER phase.

Screenshot 2023-07-05 at 11 32 00

I was then wondering whether I was missing +3000 documents until I discovered that CREATE_COLLECTION was also invoked:

Screenshot 2023-07-05 at 12 08 21

The original collection.xconf defines an update trigger as such:

<triggers>
    <trigger event="update" class="org.exist.collections.triggers.XQueryTrigger">
        <parameter name="url" value="xmldb:exist:///db/apps/path/to/module.xql"/>
    </trigger>
</triggers>

With the collection.xconf as it is in the issue description (without any triggers configured) the outcome was very different.
RAM usage stayed within reasonable bounds (<2 GB) and all memory was freed afterwards (see screenshot).

Screenshot 2023-07-05 at 09 26 17

My analysis indicates that TriggerStates are held in memory indefinitely and are the cause of the leak. This is especially interesting as the events that were triggered are not the ones that are defined and thus one would not expect them to be fired at all.

@line-o
Copy link
Member

line-o commented Jul 5, 2023

Just to be clear: Based on my findings, it is unlikely that the changes to fields is related to this issue.

@line-o
Copy link
Member

line-o commented Jul 5, 2023

@joewiz does your dataset define triggers?

@joewiz
Copy link
Member

joewiz commented Jul 5, 2023

@line-o Not in production, but yes in the version that is being prepared to deploy to production: https://github.com/HistoryAtState/frus/blame/master/volumes.xconf#L2-L9. This is the version that I used in the results I shared above.

@reinhapa
Copy link
Member

reinhapa commented Jul 5, 2023

@line-o @adamretter when would/should such a trigger event be processed/completed?

@line-o
Copy link
Member

line-o commented Jul 6, 2023

@reinhapa I don't know / have yet to familiarise myself with the Trigger implementation.

@line-o line-o changed the title Memory Leak in eXist-db's (Lucene) Full-Text and Range Indexes Memory Leak when restoring a Backup Jul 6, 2023
@adamretter
Copy link
Contributor Author

@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...

@adamretter
Copy link
Contributor Author

TriggerStates are held in memory indefinitely

These are Thread Local instances and the intention is that they should be cleared when each Trigger completes - see TriggerStatePerThread#clear().

@reinhapa
Copy link
Member

reinhapa commented Jul 7, 2023

@reinhapa @line-o This is very different to the memory profiles that I showed above. The profiles I posted clearly show the memory usage to be within the RangeIndex and LuceneIndex. Perhaps @line-o has found another different memory leak...

Hmm, when I understand @line-o corectly the import succeeded when using enough memory so far and the Range and Lucene Index free up there memory after the restore right?

If the memory at the end is on the "normal" levels again, I would not directly call that part a memory leak in the classic sense but an uncontrolled temporary overshoot (that we definitly need to look into).

@adamretter those triggers should be processed and removed afterwards right? As for the Fulltext and Range index: Can it be, that those are note "fast" enough in processing and we experience some sort of a memory back pressure there?

@line-o
Copy link
Member

line-o commented Jul 7, 2023

@reinhapa correct, with enough resources restoring the backup succeeds with or without triggers configured and only TriggerStates are not removed from memory afterwards.

@wolfgangmm
Copy link
Member

Please note that after removing the trigger – while leaving the rest of the configuration untouched – the restore process went smooth and memory consumption stayed within lower bounds as far as I could see in @line-o's reports during testing. The Lucene-based indexes thus seem to behave as expected under normal conditions, i.e. if memory is not under pressure due to other memory leaks.

@adamretter
Copy link
Contributor Author

It is very strange that we are seeing different things from the same process. I will attempt to reproduce my findings again when I have a moment. If I can reproduce them, perhaps I should post a video, as presumably there must be something different between our approaches to reproducing this.

@wolfgangmm
Copy link
Member

Maybe you get different results because you used -XX:+HeapDumpOnOutOfMemoryError, which means that the system was already in a state in which it was fighting for life, thus showing misleading results? @line-o on the other hand made sure the process would end by giving it enough memory, so this could make a difference.

@adamretter
Copy link
Contributor Author

@wolfgangmm That's an interesting idea! I didn't understand from @line-o's report that that was what he had done. I will try with and without -XX:+HeapDumpOnOutOfMemoryError and see if that makes a difference and report back.

@line-o
Copy link
Member

line-o commented Nov 17, 2023

It has been 4 months since I reported my findings. Was anybody able to falsify or reproduce them in the meantime?

@dizzzz
Copy link
Member

dizzzz commented Apr 24, 2024

ping...

@adamretter
Copy link
Contributor Author

adamretter commented Jun 6, 2024

I think there are clearly two different things being reported in this issue:

  1. The memory leak I initially reported in the Full-Text and Range indexes (implemented atop Lucene)
  2. The issue reported by @line-o with regards the TriggerStatePerThread possibly not being cleared correctly.

I will try and find the Heap Dump that @dariok sent me previously and take a look into (1) as soon as I have some free time. The first thing I will do is provide a checksum of the heap dump file, and clearer steps to reproduce the issue.

For (1) if we assume that @line-o used the same Heap Dump from @dariok as I did, then I think he must have taken different steps to me as we got quite different results. If @line-o could provide a checksum of his heap dump file (so we can ensure we are using the same heap dumps), and exact steps to reproduce what he is seeing, I could then look into that also...

@dizzzz
Copy link
Member

dizzzz commented Jun 17, 2024

Hi Adam, would you be available to discuss this subject in an upcoming community call?

@line-o
Copy link
Member

line-o commented Jun 17, 2024

I am not using a heap dump, but reproduced the issue by restoring the backup into a instance with 16 GB of RAM

  • first unchanged
  • second attempt was without triggers defined in the collection.xconf

@reinhapa
Copy link
Member

reinhapa commented Jun 17, 2024

I think there are clearly two different things being reported in this issue:

  1. The memory leak I initially reported in the Full-Text and Range indexes (implemented atop Lucene)
  2. The issue reported by @line-o with regards the TriggerStatePerThread possibly not being cleared correctly.

@adamretter im my tests using the backup file given to my by @line-o and @dariok I could observe only memory issues, when there where triggers activated with the backup file. I would very much like to discuss the potential further tests and possible solutions in a upcoming community call as suggested by @dizzzz. I find it hard to discuss it here.

@adamretter
Copy link
Contributor Author

The first thing I will do is provide a checksum of the heap dump file

I have added the exact details now at the top of this thread in the description of the issue.

Hi Adam, would you be available to discuss this subject in an upcoming community call?

@dizzzz No, I will not attend the community calls. I stopped attending them as it became a toxic environment where I no longer felt welcome; I raised that issue with several people, and nothing has been done about it. If you want to discuss this issue with me, you are welcome to contact me directly.

@reinhapa You are also welcome to contact me directly.

I think as I stated above, there are clearly two different technical issues. The issue that I opened is about the memory leak with Lucene in eXist-db versions after 6.0.1. The two heap dump files that I was provided that clearly show this are detailed at the top of this issue.

I think the potential other issue that @line-o identified should be moved into its own new issue as it is different to this one, and conflating them in this single issue thread just leads to confusion.

@adamretter
Copy link
Contributor Author

adamretter commented Jul 11, 2024

@line-o @reinhapa Would you be able to open a new issue for the other memory leak you believe you are seeing please, and copy all details there and add anything else that is needed to reproduce it step-by-step. I would like to see if I can also reproduce that one...

@line-o
Copy link
Member

line-o commented Jul 11, 2024

I am all for it @adamretter, will do. For the record: I can not reproduce the issue you describe.

@adamretter
Copy link
Contributor Author

I can not reproduce the issue you describe.

Yes, I was already clear on that. I am not sure why, as I and others can reproduce it. Anyway, I will work on some way of presenting further evidence... perhaps a screen recording showing the issue

@line-o
Copy link
Member

line-o commented Jul 12, 2024

I haven't seen anyone else commenting here that they were able to reproduce the issue.

@adamretter
Copy link
Contributor Author

I haven't seen anyone else commenting here that they were able to reproduce the issue.

Well there is clearly two people reporting this here - me and Dario. In addition there are some customers of ours (Evolved Binary) who won't comment on GitHub issues...

@line-o
Copy link
Member

line-o commented Jul 15, 2024

The issue as described here did not lead to a memory leak in my testing. The trigger definition needs to be added, which is the case in @dariok's original backup.

@adamretter
Copy link
Contributor Author

adamretter commented Jul 15, 2024

It does lead to a memory leak in my and others testing even without the triggers being defined.

I don't see how repeating yourself that you can't see the issue helps anyone at all. Let's move forward please - as requested please feel free to open a separate issue for the separate thing you are seeing @line-o

@line-o
Copy link
Member

line-o commented Jul 15, 2024

@dariok since you were the first person to notice a problem when you restored your backup: did you retest without triggers?

@adamretter
Copy link
Contributor Author

adamretter commented Aug 28, 2024

@line-o Separately to this issue, we have diagnosed your Trigger issue and have a bugfix ready to send in. As I previously requested ([1], [2], and [3]), please could you open a new issue for your Trigger issue, we can then send a PR for you with the fix marked as closing that issue.

@adamretter
Copy link
Contributor Author

@reinhapa @dizzzz I haven't heard back from @line-o about this, do you know if he is on holiday? We would like to get this fix sent in ASAP!

@line-o
Copy link
Member

line-o commented Sep 2, 2024

@adamretter It is correct that I am on holidays and I won't be able to open a separate issue this week. I believe I will find time to do this next week, however. If you want to publish your PR in the meantime @dizzzz and @reinhapa may already be able to review the fix and the issue number can then be added afterwards.

@adamretter
Copy link
Contributor Author

@line-o we will await your issue next week

@adamretter
Copy link
Contributor Author

@line-o If you are back from your holidays now, would you be able to open an issue please?

@adamretter
Copy link
Contributor Author

@line-o Hope you are ok... can we get an update please?

@adamretter
Copy link
Contributor Author

The concerns of @line-o about an issue in XQuery Triggers have now been moved into a separate issue here: #5459

@line-o
Copy link
Member

line-o commented Oct 31, 2024

This issue as described here cannot be reproduced. I propose to close it.
@dizzzz @reinhapa

@line-o line-o moved this to In review in v6.3.1 release Oct 31, 2024
@adamretter
Copy link
Contributor Author

adamretter commented Oct 31, 2024

This issue as described here cannot be reproduced. I propose to close it. @dizzzz @reinhapa

@line-o but that's just not true... two separate groups have been able to reproduce this! (see: #4890 (comment))

@line-o line-o assigned dariok and unassigned reinhapa Nov 11, 2024
@line-o
Copy link
Member

line-o commented Nov 11, 2024

@dariok Please give feedback if you were able to reproduce the issue with current develop-6.x.x

@dariok
Copy link

dariok commented Nov 21, 2024

I have executed 4 restores into clean DBs, especially with the biggest files from my dataset.

As far as I can currently tell (restriction: see #5567 which meant that I could not restore the full 23G of data I currently have in one go), restores run through without a problem.

@line-o
Copy link
Member

line-o commented Nov 21, 2024

@dariok That was done with develop-6.x.x, not 6.3.0, correct?

@dariok
Copy link

dariok commented Nov 21, 2024

@line-o
develop-6.x.x @ f2981e1 , to be precise.

@line-o
Copy link
Member

line-o commented Nov 23, 2024

The develop-6.x.x branch does include the fix for #5459 but does not revert the Lucene changes. The fact that restoring into a DB built from this source is a strong indicator that this was the only memory leak and we can close this issue as fixed as well.

@adamretter
Copy link
Contributor Author

adamretter commented Nov 23, 2024

is a strong indicator that this was the only memory leak and we can close this issue as fixed as well.

@line-o What about all of the time and effort myself and my colleagues put into reproducing this and confirming it was an issue... Does that count for nothing? Why are our reports not considered valid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

No branches or pull requests

7 participants