Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating hashes vs using from maven cache #493

Open
prabhu opened this issue May 1, 2024 · 7 comments
Open

Calculating hashes vs using from maven cache #493

prabhu opened this issue May 1, 2024 · 7 comments

Comments

@prabhu
Copy link

prabhu commented May 1, 2024

Noticed that the hashes are calculated based on the real file instead of using existing information from the maven cache.

component.setHashes(BomUtils.calculateHashes(artifact.getFile(), schemaVersion));

Further, I believe there might be a few problems with the hashing logic used.

try (InputStream fis = new BufferedInputStream(Files.newInputStream(file.toPath()),bufSize)) {
            final byte[] buf = new byte[bufSize];
            while (fis.available() > 0) {
                final int read = fis.read(buf);
                digests.stream().parallel().forEach(d-> d.update(buf, 0, read));
            }
        }
  1. There is no encoding specified, so not sure if this affects the data that gets read.
  2. Files.newInputStream appears to be non-locking as per the docs.
  3. parallel method is used which is an intermediate operation as per the documentation.

Is it possible that incorrect data might get read, if multiple build and file copy operations also happen in parallel?

I am not very knowledgeable in Java, so please correct me if I am wrong.

@ppkarwasz
Copy link
Contributor

@prabhu,

Maven Central mostly relies on MD5 and SHA1 hashes. These are the only two algorithms required by OSSRH (cf. OSSRH Requirements) and in practice they are the only ones published to Maven Central. This is why CycloneDX Maven plugin computes the hashes itself.

If you observe wrong results in a multi-module Maven build, these are most probably due to #410, not concurrency problems.

A long-standing problem in multi-module Maven builds is that it is really hard to execute some task at the end of the build, so usually the aggregate SBOM is generated based on the previous snapshots.

@VinodAnandan
Copy link

I recommend against using the Maven Central/external hash. Several Java frameworks modify the .class files. To accurately identify these modified jars, it is advisable to use locally computed hashes instead.

@prabhu
Copy link
Author

prabhu commented May 2, 2024

@VinodAnandan, in that case the purl must be updated to represent the fact that such jars might be different to the one published in maven central.

I have tried to handle this case using evidence by comparing the generated hash with the one in the maven cache. Could you kindly test with my PR and let me know how it looks?

#494

@VinodAnandan
Copy link

VinodAnandan commented May 15, 2024

@prabhu Do you mean to capture that information similar to repository_url=localhost/.m2/ ? I believe we expect the default repository of Maven to be https://repo.maven.apache.org/maven2 . However, in many enterprise cases, the default repository will be their internal repositories (Nexus or JFrog) used by their build tool. @stevespringett, @pombredanne please correct me if I misunderstood. Perhaps the tool should compare the local JAR hashes with those at the actual download location. If they differ, then it could update the PURL repository URL to repository_url=localhost/.m2/ ?

@prabhu @stevespringett What are your thoughts on capturing modified change information in the pedigree part of the CycloneDX (https://cyclonedx.org/use-cases/#pedigree)? In a recent discussion with the Quarkus team (@aloubyansky), we (@nscuro and I) also discussed Quarkus modifying their JARs. I'm not sure if they have found a solution to capture these changes yet.

@aloubyansky
Copy link

Yes, we have all the info, of course. It's a matter of properly manifesting it. I'll share some examples soon.

@aloubyansky
Copy link

@prabhu,

Maven Central mostly relies on MD5 and SHA1 hashes. These are the only two algorithms required by OSSRH (cf. OSSRH Requirements) and in practice they are the only ones published to Maven Central. This is why CycloneDX Maven plugin computes the hashes itself.

If you observe wrong results in a multi-module Maven build, these are most probably due to #410, not concurrency problems.

A long-standing problem in multi-module Maven builds is that it is really hard to execute some task at the end of the build, so usually the aggregate SBOM is generated based on the previous snapshots.

This can be done by an implementing https://maven.apache.org/ref/3.5.0/apidocs/org/apache/maven/AbstractMavenLifecycleParticipant.html and configuring the plug-in as containing extensions.

@hboutemy
Copy link
Contributor

There is no encoding specified, so not sure if this affects the data that gets read

a hash is always a hash of binary content: there never is encoding involved in hashing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants