ProjectRegistryManager and caching #1032

laeubi · 2022-11-02T16:17:36Z

laeubi
Nov 2, 2022
Maintainer

I'm currently investigating a bit about the mavenlifcyclelisteners and cam across the

ProjectRegistryManager#readProjectsWithDependencies vs ProjectRegistryManager#readMavenProjectFacades ... both read the maven model, but it seems a read for the facade actually almost always creates a read on the project (with dependencies).
I would assume the first is actually to be able to read the facade even if it has unresolved dependencies.

Apart from this, IMavenProjectFacade are retained forever, while the MavenProject cache itself is limited to 20 items by default and we just keep some basic data.

Then there is IMavenProjectFacade#getMavenProject() that only return cached value and IMavenProjectFacade#getMavenProject(IProgressMonitor) that force loads the maven project. And while i would assume that the first is very seldom used, it actually is used very often, so many parts seem to assume the cache is "hot".

This leads me to the question: What should we actually retain? Maybe it is even enough to only store the GAV, given that most access seem to assume the project is fetched anyways?

And should we use a fixed cache, or maybe better let the facade cache the MavenProject in the facade itself with a WeakReference or something letting java clean out everything if required?

I even did some quick profiling and it seems the mavenProject itself is not very big, but keeps a reference to the buildingrequest what itself is much larger but seems unused after construction. Sadly I don't know if there is some big testing project one probably could use to verify this, is there any?

laeubi · 2022-11-02T16:46:35Z

laeubi
Nov 2, 2022
Maintainer Author

It seems there is some code IMaven.detachFromSession(MavenProject) that seem to do something in that area but the comment is unclear because it talks about a non existing method MavenProject.getParentProject() looking at the impl of MavenProject it might mean getParent but I can't find a proof that this might has anything to do with what this method does.

0 replies

mickaelistria · 2022-11-02T16:47:46Z

mickaelistria
Nov 2, 2022
Collaborator

FWIW, it's not at all a new topic and there were already some increments to improve that. A few years ago, m2e was incapable of processing projects with more than ~100 modules; and this got improved by smartening up the repository manager and import, to minimize the amount of MavenProject instantiated.

IMavenProjectFacade must be retained forever (at least until the pom is modified and then updated/recreated). There must be 1 MavenProjectFacade for each Maven project in the IDE (ie an IProject with m2e nature); we could even consider creating 1 MavenProjectFacade for each pom resource in the future.
The IMavenProjectFacade is a lightweight facade that is very small in memory and in CPU usage; and that contains the bare minimal information to allow cross-projects resolution, ordered build or other Maven actions in the IDE without having to load the whole MavenProject.

On the other hand the MavenProject is a memory expensive object; and it does not scale to keep too many of them; that's why they are managed in a cache. The current cache could probably be smarter and IMavenProjectFacade.getProject() could also be smarter (I'll come back to it later); but we'll always need discard MavenProject from memory to not burn RAM. For bigger projects (eg Apache Camel or Fuse) we're talking about several GB of data.

The MavenProjectFacade is created by loading the MavenProject once and keeping the interesting data to reuse often in the IDE, such as the GAV. Most other consuming data (eg resolved deps) is dropped,
The MavenProjectFacades populate the "workspace" repository that is used for resolution of dependencies between IDE projects.

Then there is IMavenProjectFacade#getMavenProject() that only return cached value and IMavenProjectFacade#getMavenProject(IProgressMonitor) that force loads the maven project. And while i would assume that the first is very seldom used, it actually is used very often, so many parts seem to assume the cache is "hot".

They do not assume the cache is hot, I guess they intentionally avoid loading the project in the cache if there is no compelling reason to do it (eg the project is not likely to be reused soon). getMavenProject() is always to be preferred over getMavenProject(monitor) as it's a cheap operations vs a very expensive one. Each call to getMavenProject(monitor) implies several seconds of delays; about the time to run a Maven resolution for this module.

given that most access seem to assume the project is fetched anyways?

That's a wrong assumption. Some operations may accept the project not being available in memory and have logic to troubleshot that.

And should we use a fixed cache, or maybe better let the facade cache the MavenProject in the facade itself with a WeakReference or something letting java clean out everything if required?

A centralized cache is necessary because you need to 1. avoid loading projects too often (consumes much CPU/time) while 2. not keeping all of them in memory (too much RAM).

The current approach isn't too bad, but can be improved:
For example, instead of loading the 20 latest requested projects, m2e could keep in memory the 20 "deepest" projects from the most distinct branches to get a good coverage of the forest. Then if you have a MavenProject in the cache, you can ask the cache for the parent MavenProject when calling MavenProjectFacade.getMavenProject() on the parent (assuming they are loaded with the same configuration such as profiles or other flags). It wouldn't be an easy task, but it would then allow to keep a cache of 20 "root" MavenProject that could cover more than 100 MavenProjectFacade for deepest projects.
Another possible optimization is to load related projects together, so the parents are shared: imagine you load 20 "siblings" MavenProject of depth=6 (there are 6 parent projects in the IDE); then if you load them separately, you have 20 * 6=120 MavenProject instance in memory accessible for the cache, but those do actually cover 20+6=26 (depth) MavenProjectFacade. But if you happen to load all those projects together (against assuming they have the same loading options), then you end up with the parent MavenProject factorized and you get 26 MavenProjects in memory for the 26 modules; way less RAM and CPU used. So if the current state is that from a cache of 20, we actually have ~100 MavenProjects in memory for 20 MavenProjectFacades, such an optimization would allow that we could in practice consume as much CPU and RAM by loading and retaining ~100 MavenProjects for about as many MavenProjectFacades; making project likelihood to be available without reloading much higher.

I even did some quick profiling and it seems the mavenProject itself is not very big, but keeps a reference to the buildingrequest what itself is much larger but seems unused after construction.

It is big. If you profile a deep module of Apache Camel, the MavenProject instance for this module retain almost 2MB.

2 replies

laeubi Nov 2, 2022
Maintainer Author

That's a wrong assumption. Some operations may accept the project not being available in memory and have logic to troubleshot that.

If I search for reference, then there are several ones that never check for null, that leads me to the assumption that there is either something wrong, or the project is loaded anyways....

It is big. If you profile a deep module of Apache Camel, the MavenProject instance for this module retain almost 2MB.

Well it is actually not the MavenProject but the Projectbuildingrequest... if I clear this instance i get from 2 MB to a few kb... I'll prepare patch to see if this breaks anything.

mickaelistria Nov 2, 2022
Collaborator

You may want to read https://bugs.eclipse.org/bugs/show_bug.cgi?id=515668 which discussed the topic of the project sizes.

Well it is actually not the MavenProject but the Projectbuildingrequest... if I clear this instance i get from 2 MB to a few kb... I'll prepare patch to see if this breaks anything.

Please really try with deep projects and complex poms; that do duplicate things like license in every stage and others; you'll see that a MavenProject can be a big object.
Looking at my IDE, the LSP4E parent pom for instance costs ~100kB, and it's a simple pom; and it's 1/3 for the classRealm, 1/3 for the model and 1/3 for the originatedModel. There are not some many metadata in it; imagine with a long license text or other stuff we do not care in the IDE. For the org.eclipse.lsp4e children, retained size is ~300kB, repeating ~100kB from the parent project....

So be careful with a patch. This is a tricky area where looking at simple case quickly gives false impressions of performance (we load less projects, hurray!) on simple case while the same code on a more serious case can hit GC limit and cripple the RAM and thus the CPU that will spend most of its time doing garbage collection.

cpfeiffer · 2022-11-03T18:31:34Z

cpfeiffer
Nov 3, 2022

This is spot on! We had to add some hacks^Woptimizations to our fork of m2e so that it performs good enough with many projects in the workspace. In particular, we make use of the maven-tiles extension which creates "virtual parent projects" under the hood for mixin-style reuse of Maven configuration snippets. So we have not only many projects in the workspace, but also a deep (virtual) parent hierarchy, and ran both into

memory problems
as well as too much rebuilding

FWIW, our optimizations are in this branch: https://github.com/GEBIT/m2e-core/commits/1.13.0-GEBIT
e.g.

deduplication: GEBIT/m2e-core@9bffe6d
improved refresh and build order to avoid rebuilding:

They may not be useful as is, and might make you blind if you look at them, so be careful 😉

3 replies

laeubi Nov 3, 2022
Maintainer Author

@cpfeiffer its already a nightmare but I'm trying to getting it fixed. Also the rebuilding is really wrecked, after half a day debugging performance hotspots I now stuck in an endless build cycle, but I fisrt have to understand what is the real intend with all this phase 1 and pahse 2 stuff... so if you already has optimization you want to contribute or even just explain what/how this could be optimized that would be welcome.

My current idea is to just reuse the maven Graphbuilder actually as it is the way how maven handles this and then just refresh project from the roots to the leaves...

mickaelistria Nov 4, 2022
Collaborator

I think that the idea is if you have 2 modules in your IDE that are not related so far and not part of the same "tree" (so they wouldn't be built together) and one of them is updated in a way to it starts being related to the other (eg the artifactId does change so it becomes a dependency of the other). For such case, I believe the "phase 2" is expected to detect those new relationship while it's harder to detect in "phase 1" because the projects are not all up-to-date and you may miss their current dependency state. IIRC, there is a good coverage of this code and some corner-cases in tests; so it's relatively safe to change it. It doesn't make it easy though, just less risky.

laeubi Nov 4, 2022
Maintainer Author

I'll give it a try... currently phase 2 is trigger way to often from your explanation, e.g even if nothing in the poms do actually change.

laeubi · 2022-11-03T19:56:26Z

laeubi
Nov 3, 2022
Maintainer Author

I implemented a real "dumb" deduplication for parent projects already loaded and it performed very well:

ProjectRegistryManager - There are 574 unique projects in the cache (according to GAV) and a total of 575
ProjectRegistryManager - Projects read without caching: 1

So it currently only has one cache miss and reduced the number of projects cached from 3500 > 575 in the camel example!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProjectRegistryManager and caching #1032

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ProjectRegistryManager and caching #1032

laeubi Nov 2, 2022 Maintainer

Replies: 4 comments · 5 replies

laeubi Nov 2, 2022 Maintainer Author

mickaelistria Nov 2, 2022 Collaborator

laeubi Nov 2, 2022 Maintainer Author

mickaelistria Nov 2, 2022 Collaborator

cpfeiffer Nov 3, 2022

laeubi Nov 3, 2022 Maintainer Author

mickaelistria Nov 4, 2022 Collaborator

laeubi Nov 4, 2022 Maintainer Author

laeubi Nov 3, 2022 Maintainer Author

laeubi
Nov 2, 2022
Maintainer

Replies: 4 comments 5 replies

laeubi
Nov 2, 2022
Maintainer Author

mickaelistria
Nov 2, 2022
Collaborator

laeubi Nov 2, 2022
Maintainer Author

mickaelistria Nov 2, 2022
Collaborator

cpfeiffer
Nov 3, 2022

laeubi Nov 3, 2022
Maintainer Author

mickaelistria Nov 4, 2022
Collaborator

laeubi Nov 4, 2022
Maintainer Author

laeubi
Nov 3, 2022
Maintainer Author