-
-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Consumption of doxyindex & doxygenclass #315
Comments
Hi, you might want to try changing the cache that Breathe uses. By default it just stuffs all the results of parsing files into a dictionary which is created here: https://github.com/michaeljones/breathe/blob/master/breathe/parser/__init__.py#L99 The cache was introduced to speed things up but realistically it might get very big. If you replace that dictionary with an object that has a dictionary interface but never stores any data then you might have luck reducing the memory usage. It is possible that some kind of LRU cache might represent the best of both worlds but I've no idea what the cache hit/miss pattern would be so it is hard to say. |
Also, thanks for using Breathe and for raising this issue! :) |
Do you have any results from this at all? Happy to introduce a basic cache/no-cache flag to the main configuration options to help people deal with this. |
I did not try to modify the cache yet but are now at a point where even selecting a few dozen classes (out of many 100; number of doxygen xml files is: 2791) in a page via update: I tried skipping the cache build at all which makes the build slow as expected... |
digging a bit deeper, the underlying problem seems to be that single entries in the cache can easily get up to MBytes in size. The main consumers seem to be |
the interesting thing is that the |
I think the main issue is the memory footprint of the xml parser I think we have two ways to speed things up while keeping a cache. A) keep everything in place but cache the raw file and parse on every access in-RAM I will hack around a bit and report back. |
Since all the reports on I tried starting to replace Did not succeed yet fully, but here is the branch so far: (progress logged in commit message) |
@michaeljones although I don't have the time to work on this deeper (sorry!), I think this issue should become some priority for the project since it needs coordination and is quite limiting for deployments on e.g. readthedocs. I think the overall project should migrate away from the memory-leaking and slow |
@svenevs since you are working on My try to get rid of |
Hehe. LOL well Exhale is reliably slower for different reasons. Here's a braindump. It is probably more information than you want... I'm finishing up some touches to upload it to Worth Keeping in MindThe "legacy" exhale was done in one file, and (through dubious Notes on BeautifulSoup and lxml
Parsing the API and MemoryIn Exhale, I was very cautious to not build out the entire graph while loading in all XML documents. I know for a fact that I can get away with this, but I'm not sure that
So basically, the question for In the endI think I'm ready to re-upload to Note: I don't actually have any memory benchmarks for Exhale, only timing. I only know how to do memory benchmarks in C++... |
Thanks for the detailed answer!
the main problem is not that the entire XML dom gets loaded at once. that is fine and efficient for access to it and even large projects seldom generate more than 100 MB of plain XML (that is smaller in RAM when parsed). The main problem is that |
Ahhh Ok I see. So you have two problems:
If the API is stable enough, did you try using TBH though I tried giving
Anywho, I guess the point is both seem to consume a LOT of memory when parsing the Doxygen XML for your project. For Exhale, I guess I have to lay blame on |
Thx for giving it a try! Yes, I think the underlying problem is actually just My |
@svenevs @michaeljones @SylvainCorlay did by coincidence any of you try to reduce the memory footprint of breathe? On RTD, I get regularly killed in builds due to the leaks in I tried above to get rid of |
So if we're convinced that I think you may be able to regenerate using Myself, I would vote for (if it actually works), replacing So to do it, I think the steps you wold need to take are
Then I think all that changes is this method to use I don't know if the |
thanks for the information, that is super useful! I gave it a little try and indeed in recent versions generateDS depends on lxml with a fallback to ElementTree :) @michaeljones do you recall or did you save the original call to I tried so far: generateDS -f -o compound.py -s compoundsuper.py --super="compoundsuper" ~/src/picongpu/docs/xml/compound.xsd
generateDS -f -o index.py -s indexsuper.py --super="indexsuper" ~/src/picongpu/docs/xml/index.xsd which at least updates me the four files in breathe/parser/, but not the But installing and trying it leads to errors of the form
|
I'm afraid I'm not up to date on this thread yet, though I will try to catch up soon. I don't believe I have the original generation command saved as the generated files have long since been edited due to parts that felt that they were missing somehow. I think there are levels of indirection in the xml that weren't handled well so manual adjustments seemed necessary and at the same time the doxygen xml seemed stable enough that regenerating from a new schema was not going to be a win. I believe that the generated files used to have the command at the top, or perhaps just a comment indicating that they were generated. I removed it when it became apparent that I wasn't going to regenerate them. It might still be available in the source history though. |
Thanks for catching up! :) Unfortunately, the old commands only have a |
Sweet. I believe what this means is, if this is all successful, File ".local/lib/python2.7/site-packages/breathe-4.7.3-py2.7.egg/breathe/finder/core.py", line 29, in create_finder
return self.finders[data_object.node_type](self.project_info, data_object, self)
AttributeError: 'DoxygenType' object has no attribute 'node_type' That is part of "step 2", sorry if that wasn't clear. You'll have to manually splice that all back in to the super / sub classes 😱 E.g. That guy is the only thing that actually tells the rest of breathe what this is (either for directives, signatures, or rendering things like Also, I'm not sure if you are aware, but you can do "editable" installs with # In directory
# /some/path/breathe
# - setup.py
# breathe/
# parser/ ...
$ pip install -e . The egg link is generated in the local directory, and you can keep updating the files here without needing to re-install. |
|
Btw, found the changelog of generateDS (probably something like 1.18 in 2009 to 2.29 today. |
I have currently added all required |
I am not fully building, but from a first test the memory consumption is still really bad... |
Are you able to quantify "really bad"? Is it still greater than 4GB? |
Also, what version of python are you using? In |
Thanks for the hint. I am testing with Python 2.7 and 3.5 - interesting issue with the I am not sure if I posted it already, but this is how one can reproduce it with our data: git clone https://github.com/ComputationalRadiationPhysics/picongpu.git
cd picongpu
git checkout e3cc934c31020ef19a387a9987fac1f5b7595742
cd docs
doxygen
du -hs xml/
# 132 MB
make html
# njom njom, RAM be gone |
Are you working with an installed copy of your sven:~/Desktop/picongpu/docs> make html
Running Sphinx v1.6.7
making output directory...
Exception occurred:
File "/usr/local/lib/python3.6/site-packages/breathe/parser/__init__.py", line 3, in <module>
from . import compound
File "/usr/local/lib/python3.6/site-packages/breathe/parser/compound.py", line 6809
, emsp=None, thinsp=None, zwnj=None, zwj=None, lrm=None, rlm=None, ndash=None, mdash=None, lsquo=None, rsquo=None, sbquo=None, ldquo=None, rdquo=None, bdquo=None, dagger=None, Dagger=None, permil=None, lsaquo=None, rsaquo=None, euro=None, tm=None, valueOf_=None, mixedclass_=None, content_=None):
^
SyntaxError: more than 255 arguments
The full traceback has been saved in /var/folders/wx/j7610hj10r3_h1xs4xr41bgw0000gn/T/sphinx-err-jai8fnd_.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
make: *** [html] Error 1 Note: I'm not really sure that going through and fixing this will remedy the memory consumption errors. Did you ever look into the cache stuff? I feel like that may be where the memory overhead is coming from. I'm not familiar with the implementation, but presumably I was playing around with Maybe you can just ignore it (and keep the original classes)? I think maybe I lost motivation to pursue, but this is how you would solve it. I think there are only two or three classes this applies to, and it's mostly just because it's representing all of the possible markup options from Doxygen. The above error comes from something like this class docTitleType(GeneratedsSuper):
node_type = "doctitle"
subclass = None
superclass = None
def __init__(self, ulink=None, bold=None, emphasis=None, MANY_OTHERS):
self.original_tagname_ = None
if ulink is None:
self.ulink = []
else:
self.ulink = ulink
if bold is None:
self.bold = []
else:
self.bold = bold
if emphasis is None:
self.emphasis = []
else:
self.emphasis = emphasis For both def __init__(self, **kwargs):
self.original_tagname_ = None
# I'm 95% sure you can still do kwargs.get(...)
self.ulink = kwargs.get("ulink", [])
self.bold = kwargs.get("bold", [])
self.emphasis = kwargs.get("emphasis", []) In the sub-class, instead of class docTitleTypeSub(supermod.docTitleType):
node_type = "doctitle"
def __init__(self, ulink=None, bold=None, emphasis=None, MANY_OTHERS):
arglist_ = (ulink, bold, emphasis, MANY_OTHERS)
super(docTitleTypeSub, self).__init__(*arglist_) you just forward def __init__(self, **kwargs):
super(docTitleTypeSub, self).__init__(kwargs) |
@ax3l (and anyone else interested) I gave a stab at fixing @michaeljones do you happen do recall when you did |
readthedoc is killing the build. See breathe-doc/breathe#315.
Would using doxygen generated sqlite database instead of parsing xml help speed up the generation? |
It would probably be quicker than the current method. A proper, efficient, XML parser would be at least (around) as fast as using a sqlite database though. |
@melvinvermeeren Our codebase has around 5000 thousand files, as of now this takes around 6-7 hours to complete, so looking for ideas to speed up the process. Note: https://github.com/mosra/m.css finishes in minutes, I should take a look to see if I can get some ideas from there. |
Sorry to revive a very old thread, but I'm trying to debug some breathe performance issues. My sphinx build goes from 20sec to 3min when I add breathe (and one page referencing a few dozen
There's quite a bit of discussion on replacing minidom, something about some generated code (which I'm not really following), and there was one comment suggesting that disabling the cache actually might help. @ax3l , @svenevs, @michaeljones and anyone else who might be following this thread: do you think 1-2 weeks of dedicated engineering time would be sufficient to solve this problem? Or is it even bigger than that? I have some person hours I can donate to the project, but not for too long. Where would be the best place for them to start? (parser/init.py line 99 doesn't exist anymore. I didn't see anything obviously doing caching in there). @vermeeren are you the current maintainer? I saw that you commented on #544. Maybe you know? |
Brief response from phone, the snag is that the way breathe represents it's hierarchy internally was using something to take the doxygen xml schema and turn it into Python classes (generateDS). For as close as I got, I don't think it's a great path forward. I think there's enough test cases in place to know if things fully break, meaning changing something like the backend parsing could be achievable. But I'm not sure how much the rest of breathe needs to change if the gernerateDS graph goes away (all the compound super stuff). My vote at this point would be using lxml to parse, or beautiful soup + lxml, but beautiful soup may add noticeable overhead. FWIW I heard a tale about lxml and opening too many files at once leading to slowdowns, so if implemented you'd want to be sure to close things as soon as you're done with them. Changing the xml parsing would be hugely valuable to this project, but I'm not sure how substantive of an effort this will be (current maintainers could give a better idea). If it gets started but not finished I'm open to trying to slowly pick up the slack, this is something we've all desired for a long time but it's a big cookie. |
@cheshirekow Although I am indeed the current (primary) maintainer since 2018-ish there are many technical parts of Breathe that I haven't worked with in-depth, which includes the XML parser. Generally speaking I have an idea of what and where but there are many specifics that I don't yet know much of. (Often this is enough to help with issues, contributions and questions, but for plenty of complex topics it isn't.) So I'm afraid I likely know less than the other participants in this thread and have no idea how much time it would take to implement the changes desired or what you could run into. @michaeljones probably knows most, but considering the years passed since he wrote the current implementation it might be tricky to recall the specifics. |
Yes, it has been a long time and the details are quite blurry to me. I've just re-read the thread but don't have anything specific to add. I realise it is all long overdue. I would be happy enough to attempt it if properly funded. Though my preferences are not really for Python any more. When I think of tackling this, I dream of re-writing some of the parsing & data handling in Rust and expose just enough to the Python layer for it to fill out the structures that Sphinx needs. That would be silly if there was a more drop in alternative for minidom that would solve the issue more cleanly but I'm not active enough in the Python ecosystem to understand that route. A step forward perhaps, would be for people to provide example code bases to work with so there is something to profile and inspect and then anyone that attempts it to report data and findings back here. I don't really know what the memory tracing and performance tooling is like in Python though. |
Some time ago I looked into regenerating the parser with generateDS both because the Doxygen XML has changed with newer versions, but also because it seems to use lxml now. So I think that may be easiest first step. So, assuming we would have development resources, my suggestion for a "roadmap" would be:
|
Thanks for sharing those results @DuTogira - definitely interesting to see. |
@michaeljones / @svenevs based on this, what's the best path forward? It seems we have multiple avenues, including:
Are either of these feasible avenues forward which we could hope for in the near future? |
I'm not sure anyone is in a good position to answer that. It would take some time to assess the current state of the code and figure out the best way forward. To my knowledge, no one active on the project has a full understanding of the code base due to either having not worked on all of it or having forgotten what they once knew. There are a few of us who could be funded to work on it but otherwise it is for others to pick up. Speaking for myself, I'm afraid that even getting into a position where I could advise on the best way forward would take time that I don't currently have the motivation for. I'm sorry that these statements are a bit of a bummer. I'm keen to see the project properly funded by the companies that benefit from its existence but I'm not sure how to go about making that a reality so I'm neither moving forward with funding nor with active work in the absence of funding. We do have an open collective if you (or your company should there be one) are in a position to offer funding. |
Can you tell me more about the funding model via this collective. Can we "buy a feature" or is this a general "support us if you use us" kind of thing. I can certainly request my employer to fund the performance improvements we are looking for and I'm happy to push the buttons to get that started. We can also fund the development with (some) eng hours to do the work, though that would go best with some partnership to guide the developer on what needs to be done where. |
Thank you for asking. We don't have a well established model yet but at least two of our maintainers are able to work under a "buy a feature"approach, I think. I'm not experienced with how those discussions go or how to price the work but it would definitely be something that we'd be happy to explore. We'd also be happy to have "pay it if you use it" funding and put those funds towards specific feature work but I can understand if that seems like less of a direct return for some business interests. |
If anyone is still interested in lxml I started a PR for it #891 It's not complete but unit tests are passing. |
I am trying to set up automated build of our docs on rtd, which is limited to 1 GB mem usage (and 15min build time).
The sphinx step
is at this point a rather large bottleneck. Is it possible to somehow limit the memory consumption of it? (Is this even a breathe issue or a sphinx issue?)
The text was updated successfully, but these errors were encountered: