Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AnnexScalaInstance jar cache misses due to ordering and path comparison problems #64

Merged
merged 1 commit into from
Nov 12, 2024

Conversation

jjudd
Copy link

@jjudd jjudd commented Oct 31, 2024

The jar cache in AnnexScalaInstance is a ConcurrentHashMap that uses a string representation of the compiler classpath (a list of jars) as its key.

Problem is that list of jars which is used to create the key is not sorted, so you would sometimes get cache misses for the same classpath because the list had a different ordering.

At a large enough scale, this would cause the JVM to run out of CodeHeap and crash. Presumably this was because we'd create up to n! AnnexScalaInstances for every AnnexScalaInstance and then classload for every single one of those instances.

This change prevents those cache misses and thus prevents the worker from crashing.

This should also fix cache misses with path mapping by comparing the short path, so we don't include the bazel arch/config specific parts of the path.

@jjudd jjudd requested a review from jadenPete October 31, 2024 03:59
@jjudd jjudd changed the base branch from master to lucid-master October 31, 2024 03:59
* the path. This is analogous to Bazel's File.short_path function. Fair warning: this function is super good enough,
* but is likely not perfect.
*/
def bazelShortPath(path: Path, replaceExternal: Boolean = true): Path = {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this in a function because I use this code in the path mapping PR for the deps checker and want to be able to reuse it in that PR

val mapBuilder = Map.newBuilder[Path, Path]
// We want to compare short paths to avoid the Bazel sandbox prefix and arch/config specific parts of the path
// We use a tree map because we want the keys sorted for comparison purposes
val mapBuilder = TreeMap.newBuilder[Path, Path]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do:

val key = allJars
  .map { jar =>
    FileUtil.bazelShortPath(
      absoluteWorkDir.relativize(jar.toPath().toAbsolutePath().normalize()),
      replaceExternal = false,
    )
  }
  .toSet

?

Then, we wouldn't have to worry about sorting or order. Also, why do we construct a :-separated string instead of using a Set as the key?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the map later on in this file as I need to copy the jar from its original location to its destination location.

I use the string for the keys because classpaths typically are jars concatenated with :.

I have to loop through the keys once anyway to make the string. I have to loop through the keys once to make the set.

Benefit of the set is that I don't have to insert things into a treemap, so that should be faster.

Overall I think using a map and the set as the cache key will be faster and safer. I'll make some updates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better I can make the key while making the map provided set has a builder. That would be a nice micro optimization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, that would be a good optimization. Even if constructing a Set turns out to be slightly slower because we can't reuse the result of maping through, I think the safety benefits outweigh the performance detriments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using an unsorted map and a set. I use the set for the key and that seems to work great. I also tried using the .values Iterable as the key from the sorted map, but that resulted in cache misses. Didn't bother to dig in and find out why. I figure creating a set and an unsorted map is probably roughly the same cost as creating a sorted map.

*/
def bazelShortPath(path: Path, replaceExternal: Boolean = true): Path = {
val nameCount = path.getNameCount()
val pathString = path.toAbsolutePath().normalize().toString()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. It's a remnant of another approach. I'll remove.

}

// Handle difference between Bazel's external directory being referred to as .. in the short_path
if (replaceExternal && shortPath.startsWith("external")) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more correct way of doing this might be:

val shortPathNameCount = shortPath.getNameCount()

if (shortPathNameCount > 0 && shortPath.subpath(0, 1) == Paths.get("external")) {
  Paths.get("..", shortPath.subpath(1, shortPathNameCount))
} else {
  shortPath
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon first reading I like this.

I think shortPathNameCount > 0 && shortPath.subpath(0, 1) == Paths.get("external") is equivalent to shortPath.startsWith("external"), so I don't think I'll change that.

I'll play around with the Paths.get approach to avoid the string operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a technicality, but shortPath.startsWith("external") would match something like externalfoo/....

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, yes. I realized that when working on this yesterday and ended up using your suggestion.

@jjudd jjudd force-pushed the fix-instance-cache branch from 8c4b8ff to eabce1e Compare November 11, 2024 21:03
…arison problems

The jar cache in AnnexScalaInstance is a ConcurrentHashMap that uses a
string representation of the compiler classpath (a list of jars) as its
key.

Problem is that list of jars which is used to create the key is not
sorted, so you would sometimes get cache misses for the same classpath
because the list had a different ordering.

At a large enough scale, this would cause the JVM to run out of CodeHeap
and crash. Presumably this was because we'd create up to n!
AnnexScalaInstances for every AnnexScalaInstance and then classload for
every single one of those instances.

This change prevents those cache misses and thus prevents the worker
from crashing.

This should also fix cache misses with path mapping by comparing the
short path, so we don't include the bazel arch/config specific parts of
the path.
@jjudd jjudd force-pushed the fix-instance-cache branch from eabce1e to 5b052ec Compare November 11, 2024 21:30
@jjudd jjudd merged commit 34a88a1 into lucid-master Nov 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants