Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incremental build copy the entire distribution folder from S3 without cleaning up component zips that requires rebuild #455

Closed
peterzhuamazon opened this issue Jul 12, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Jul 12, 2024

In current 3.0.0, alerting succeed on building the zip, but failed on the next step of publishing zip to maven:
opensearch-project/alerting#1599

This means alerting build failed so new plugin zip will not copy to <distribution>/builds/opensearch/plugins, even if it build successfully after all.
https://build.ci.opensearch.org/blue/organizations/jenkins/distribution-build-opensearch/detail/distribution-build-opensearch/10015/pipeline/69

However, due to incremental build, even though alerting requires a rebuild, the entire distribution folder including older alerting plugin zips are being copied from S3.

Since new alerting zip did not override the old alerting zip, the old alerting zip was pushed again to S3 and treated as new alerting zip. And recorded into the build manifest.

This ends up creating a docker image with alerting installed, and that alerting zip is old, and cause jarhell:
https://github.com/opensearch-project/security-analytics/actions/runs/9899572936/job/27348744950?pr=1156#step:6:124


#8 1.202 -> Installing file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 1.203 -> Downloading file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.175 -> Failed installing file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.175 -> Rolling back file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.183 -> Rolled back file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip
#8 2.183 Exception in thread "main" java.lang.IllegalStateException: failed to load plugin opensearch-security-analytics due to jar hell
#8 2.183 	at org.opensearch.plugins.PluginsService.checkBundleJarHell(PluginsService.java:693)
#8 2.183 	at org.opensearch.plugins.InstallPluginCommand.jarHellCheck(InstallPluginCommand.java:861)
#8 2.183 	at org.opensearch.plugins.InstallPluginCommand.loadPluginInfo(InstallPluginCommand.java:829)
#8 2.183 	at org.opensearch.plugins.InstallPluginCommand.installPlugin(InstallPluginCommand.java:874)
#8 2.184 	at org.opensearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:276)
#8 2.184 	at org.opensearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:250)
#8 2.184 	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
#8 2.184 	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
#8 2.184 	at org.opensearch.cli.MultiCommand.execute(MultiCommand.java:104)
#8 2.184 	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
#8 2.184 	at org.opensearch.cli.Command.main(Command.java:101)
#8 2.184 	at org.opensearch.plugins.PluginCli.main(PluginCli.java:66)
#8 2.184 Caused by: java.lang.IllegalStateException: jar hell!
#8 2.184 class: com.google.common.annotations.Beta
#8 2.184 jar1: /usr/share/opensearch/plugins/opensearch-job-scheduler/guava-32.1.3-jre.jar
#8 2.184 jar2: /usr/share/opensearch/plugins/opensearch-alerting/guava-32.0.1-jre.jar
#8 2.184 	at org.opensearch.bootstrap.JarHell.checkClass(JarHell.java:316)
#8 2.184 	at org.opensearch.bootstrap.JarHell.checkJarHell(JarHell.java:215)
#8 2.184 	at org.opensearch.plugins.PluginsService.checkBundleJarHell(PluginsService.java:675)
#8 2.184 	... 11 more
#8 ERROR: process "/bin/sh -c /usr/share/opensearch/bin/opensearch-plugin install --batch file:/tmp/opensearch-security-analytics-3.0.0.0-SNAPSHOT.zip" did not complete successfully: exit code: 1

Even though this issue has been fixed a month ago:
opensearch-project/alerting#1571

We should just remove the rebuilding plugin's zip after the initial S3 copy to avoid this invalid recording:
https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/distribution-build-opensearch/runs/10015/nodes/69/steps/484/log/?start=0

Thanks.

@peterzhuamazon
Copy link
Member Author

Hi @zelinh,
Would you mind taking a look as he implement the incremental build.

Thanks

@peterzhuamazon peterzhuamazon added the bug Something isn't working label Jul 12, 2024
@zelinh
Copy link
Member

zelinh commented Jul 12, 2024

I think according to our build workflow by default the new build plugin would override the previous artifacts. This might be a windows platform specific issue.

@peterzhuamazon
Copy link
Member Author

This is not the windows issue but rather conflict with --continue-on-error.

@zelinh
Copy link
Member

zelinh commented Jul 12, 2024

Yes. Discussed with @peterzhuamazon offline and this is more like a conflict with our 'continue-on-error' argument in build workflow. When one non-essential plugin failed, the build workflow will continue so there is no new artifact to override the old one. Therefore assemble workflow will assemble on the old artifacts and cause issue.

@prudhvigodithi
Copy link
Member

In current 3.0.0, alerting succeed on building the zip, but failed on the next step of publishing zip to maven

@zelinh @peterzhuamazon With this the alerting plugin should be treated as failure right and I can see this in the logs Error building alerting, but how come the incremental build fetching this? Even though with continue-on-error, it should continue to build the next plugin in the manifest. Th incremental should not even pull the past failed alerting plugin right ?

@zelinh
Copy link
Member

zelinh commented Jul 17, 2024

In current 3.0.0, alerting succeed on building the zip, but failed on the next step of publishing zip to maven

@zelinh @peterzhuamazon With this the alerting plugin should be treated as failure right and I can see this in the logs Error building alerting, but how come the incremental build fetching this? Even though with continue-on-error, it should continue to build the next plugin in the manifest. Th incremental should not even pull the past failed alerting plugin right ?

The first step of running incremental build is to download the previous built artifacts into local; and then start building incrementally based on the previous manifest. So I believe that's where the previous successful built artifacts was introduced.

@prudhvigodithi
Copy link
Member

The first step of running incremental build is to download the previous built artifacts into local; and then start building incrementally based on the previous manifest. So I believe that's where the previous successful built artifacts was introduced.

Right, it should not even pull here in case for alerting right Zelin? Since the alerting build failed (even though the artifacts are pushed to s3) the incremental should not even pull the alerting right ?

@peterzhuamazon
Copy link
Member Author

The first step of running incremental build is to download the previous built artifacts into local; and then start building incrementally based on the previous manifest. So I believe that's where the previous successful built artifacts was introduced.

Right, it should not even pull here in case for alerting right Zelin? Since the alerting build failed (even though the artifacts are pushed to s3) the incremental should not even pull the alerting right ?

It pulled, because of s3 copy beforehand and the cache is not removed.
So the recorder just record whatever in the tar folder.
The old cache was mistakenly treated as new because override did not happen.

@prudhvigodithi
Copy link
Member

Got it thanks @peterzhuamazon and @zelinh , I might be missing something here, but if the past build is red (failed) for a component (here in this case for alerting), it should not even look for cache/recorder etc, it should blindly start the new build next time with incremental flag.

@peterzhuamazon
Copy link
Member Author

Got it thanks @peterzhuamazon and @zelinh , I might be missing something here, but if the past build is red (failed) for a component (here in this case for alerting), it should not even look for cache/recorder etc, it should blindly start the new build next time with incremental flag.

This is more like a side effect of jenkins, as we use s3 copy function to copy the entire tar folder from s3 to agent. We need to look into cleaning up plugins that are labeled to rebuild instead of blindly copy the entire tar folder over.

Thanks.

@zelinh
Copy link
Member

zelinh commented Aug 12, 2024

I'm able to reproduce this issue when enabling both -continue-on-error and incremental. I will look into this and see how we could make them compatible in this case.

@zelinh
Copy link
Member

zelinh commented Aug 13, 2024

After discussed offline with @peterzhuamazon, I believe this behavior is by design.
When new commits coming in, if only enable --incremental, failing plugin will stop the build from creating new bundle since the build workflow will fail.
If enable both --incremental and --continue-on-error, non-essential failing plugin will be ignored and continue on building other plugins. The new bundle will be created with previous successful built plugin(s) if applicable. We will also cut a build failure ticket to respective plugin repo and mark the distribution-build job Unstable. In this situation, the plugin with new commits will still be included in the bundle if it passed before but the previous built artifacts may not include their latest commits. Plugin team will need to actively fix any of the build errors based on the GitHub issue we cut to them.

Closing this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants