Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd-build: Fix wrong image.yaml after buildfetch #3618

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

mtalexan
Copy link
Contributor

@mtalexan mtalexan commented Sep 14, 2023

Add option to cmdlib.import_ostree_commit to conditionally exclude the image.json extraction that overwrites.
Add option to import_ostree_commit wrapper.
Use import_ostree_commit exclusion of image.json option when importing after creating image.json.

/closes #3616


The buildextend-* options don't need to add the flag when they call/use import_ostree_commit_for_build (or equivalent, some are in Python instead of bash) because they only work if you run them after the build has completed. For the buildextend-* then, it works like an enforcement that the image.json/yaml being used is the one from the original build and not one that's been modified since.


An alternative I considered was just making a copy of the tmp/image.json before calling import_ostree_commit_for_build and then restoring it afterward, but that file is also copied into tmp/override/usr/lib/coreos-assembler/image.json and also gets overwritten. So it would be trying to sync bypass logic in cmd-build with the extract_json() function in cmdlib.py.


My assumption is that it was not intentional that running a build after a buildfetch would silently ignore your config's image.yaml and use the one from the fetched build.

But if that's not the case, then we instead need to add the logic for merging the imported image.json on top of the image-default.yaml when it's imported. Otherwise it's impossible to ever change what's mandatory in the image.yaml/json without breaking the ability to use buildfetch.

Basically, it's mandatory that the image.json being run for a build always have the minimum fields defined in the current tools' image-default.yaml, regardless of where it came from.

@openshift-ci
Copy link

openshift-ci bot commented Sep 14, 2023

Hi @mtalexan. Thanks for your PR.

I'm waiting for a coreos member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mtalexan mtalexan force-pushed the fix-buildfetch-image-overwrite branch from 059425f to a230095 Compare September 14, 2023 18:55
@mtalexan mtalexan changed the title cmd-build: Fix buildfetch image.yaml overwriting configs cmd-build: Fix buildfetch image.yaml overwriting config's Sep 14, 2023
@mtalexan mtalexan changed the title cmd-build: Fix buildfetch image.yaml overwriting config's cmd-build: Fix buildfetch image.yaml overwriting config's Sep 14, 2023
@mtalexan mtalexan changed the title cmd-build: Fix buildfetch image.yaml overwriting config's cmd-build: Fix wrong image.yaml after buildfetch Sep 14, 2023
@dustymabe
Copy link
Member

dustymabe commented Sep 14, 2023

For the buildextend-* then, it works like an enforcement that the image.json/yaml being used is the one from the original build and not one that's been modified since.

I think we intentionally embed the image.json/image.yaml into the OSTree commit so that it's canonical for the later stages. i.e. the later stages are not intended to pick it up from the config but rather from the OCI Archive/OSTree commit.

If you are running cosa build ostree again in your buildfetched directory and it's telling you there is no new build then there is probably a bug in our our "no new build" detection logic.

TL;DR if you change the image.yaml, you need to cosa build ostree again to pick up the change and that's intentional.

@mtalexan
Copy link
Contributor Author

mtalexan commented Sep 14, 2023

I'm actually just calling:
1. cosa init
2. cosa buildfetch ...an older build...
3. cosa fetch
4. cosa build

The problem is that the image.json extracted from buildfetch is overwriting the image.json that was created from the config's image.yaml.

In particular this causes a problem when #3483 went in, that now has a mandatory composefs: in the image.json. If I buildfetch from a build before that went in, and try to run a build with the current tools, it aborts during the build with a fatal error because the buildfetch'd image.json doesn't have composefs: in it.

EDIT: Misunderstood what was being said

@mtalexan
Copy link
Contributor Author

mtalexan commented Sep 14, 2023

EDIT: Moved into description.

My assumption was that it was not intentional that running a build after a buildfetch would silently ignore your config's image.yaml and use the one from the fetched build.

But if that's not the case, then we instead need to add the logic for merging the imported image.json on top of the image-default.yaml when it's imported. Otherwise it's impossible to ever change what's mandatory in the image.yaml/json without breaking the ability to use buildfetch.

Basically, it's mandatory that the image.json being run for a build always have the minimum fields defined in the current tools' image-default.yaml, regardless of where it came from.

@mtalexan mtalexan force-pushed the fix-buildfetch-image-overwrite branch from a230095 to 7c057ca Compare September 14, 2023 19:42
@mtalexan
Copy link
Contributor Author

mtalexan commented Sep 14, 2023

For the buildextend-* then, it works like an enforcement that the image.json/yaml being used is the one from the original build and not one that's been modified since.

I think we intentionally embed the image.json/image.yaml into the OSTree commit so that it's canonical for the later stages. i.e. the later stages are not intended to pick it up from the config but rather from the OCI Archive/OSTree commit.

If you are running cosa build ostree again in your buildfetched directory and it's telling you there is no new build then there is probably a bug in our our "no new build" detection logic.

TL;DR if you change the image.yaml, you need to cosa build ostree again to pick up the change and that's intentional.

Yeah, sorry that's exactly what I meant by my statement. It's really not a valid case to call build and then go change your image.yaml and then call buildextend-* on it. Functionally it ends up regenerating the image.json from the image.yaml, but we really want it to keep with the same image.json. The current implemenation for buildextend-* does exactly that by silently overwriting the image.json that was regenerated from the image.yaml with the one re-extracted from the prior commit created by the build.

I was just pointing out that I did think about that case, and the calls to import_ostree_commit (or equivalent) don't need to be modified there because it's correct behavior to overwrite the image.json in those cases.

@mtalexan
Copy link
Contributor Author

Jenkins is complaining about the python test coverage on cmdlib.py falling below the 74.74% threshold.

Looking at the tests in tests/test_cosalib_cmdlib.py however, they're all unit tests of stand-alone functions that have simple setup. There's nothing like trying to test import_ostree_commit, which is what I changed and has very complicated setup in order to test it.

@mtalexan
Copy link
Contributor Author

It seems there was some confusion about this specific bug, as posted over in the Issue. The problem is that a cosa buiild after a cosa buildfetch is getting the yaml overwritten by the one from the buildfetch.

The changes here are making sure that for only that case we don't copy the image.json from the prior build over the top of the generated one.

As mentioned, it's reasonable for buildextend-* to do this overwrite, and in fact ensures the expected usage. When calling buildextend-* it's assumed you have not made any additional code changes since you ran the build, so overwriting the image.json that was re-generated from the code during the buildextend-* with the one from when you ran the build is correct and expected.

@mtalexan
Copy link
Contributor Author

Bump. Not sure how I can do anything about the reduced Python coverage, and I think we're all on the same page about this fixing something that does need fixing.

@dustymabe
Copy link
Member

@mtalexan i'm still not sure 100% on the problem. mind grabbing me in matrix? #coreos:fedoraproject.org

you can use chat.fedoraproject.org if you don't already have a setup.

@mtalexan
Copy link
Contributor Author

I did more digging on the implementation and logic behind image.json being extracted from the ostree commit as part of import_ostree_commit and found out that it was part of the original design intent to reset the current workdir to a state where a build could occur without the config being present anymore. Obviously that causes issues when it's called after some files have been generated from the config. There's a full description back on the Issue.

The proposed solutions from that comment are:

  1. Remove the extraction of the image.json from the import_ostree_commit, changing the intent of the function to only what the name suggests.
  2. Put import_ostree_commit and associated logic in prepare_build
  3. Break up the prepare_build function so it does the setup and ostree repo creation separately from the image.json (etc) generation and call the import_ostree_commit between the two.
  4. Modify import_ostree_commit so we can tell it not to overwrite for cmd-build
  5. Make a copy of the image.json before doing the import_ostree_commit in cmd-build, and then restore it afterward the import is done.
  6. Modify import_ostree_commit so it can be told not to overwrite the image.json if it's already present.

I've already implemented Option 4 here, but I wonder if Option 6 might be better.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to just stop calling extract_image_json() from import_ostree_commit() and let each caller own whether they want to call it themselves or not (and then cosa build just wouldn't call that). But that's a little error-prone and less DRY.

cosa build is special in that it's the only command that creates new builds. All the other commands in which the import code is run are buildextend commands and so may need image.json recreated. So special-casing just cosa build seems reasonable to me.

Some suggestions but overall LGTM. For the commit message, maybe let's explain it a bit more clearly. E.g.:

cmd-build: always use freshly created image.json

When importing the previous commit during a cosa build, we would clobber the image.json file we just derived from the source config. Fix this by telling import_ostree_commit() to skip JSON extraction in this path.

src/cmdlib.sh Outdated Show resolved Hide resolved
src/cmdlib.sh Outdated Show resolved Hide resolved
@mtalexan mtalexan force-pushed the fix-buildfetch-image-overwrite branch from 7c057ca to d51baf0 Compare October 18, 2023 22:46
@mtalexan
Copy link
Contributor Author

Sorry @jlebon, I was pulled onto other things for the past while. I've updated with your recommendations now, including the commit message fixup.

@mtalexan
Copy link
Contributor Author

The Jenkins error is just that Python unit test coverage dropped below the limit. I don't see anywhere that additional python unit testing can be added without a huge amount of additional work though, the function I touched would be a large undertaking to add unit tests for.

jlebon added a commit to mtalexan/coreos-assembler that referenced this pull request Oct 19, 2023
We haven't invested too much in trying to unit test our Python code. It's
mostly glue code stuff that's better tested with integration tests like
we do in CI and pipelines.

Let's only lower the threshold here for now but in the future we should
probably consider just dropping a threshold entirely.

Prompted by coreos#3618 which triggered this.
jlebon
jlebon previously approved these changes Oct 19, 2023
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed it's probably not worth it. Honestly, I'm not sure how much value that threshold actually has here. Anyway, for now I've just lowered the threshold some more.

Thanks for the patch!

@jlebon jlebon enabled auto-merge (rebase) October 19, 2023 14:02
dustymabe
dustymabe previously approved these changes Oct 19, 2023
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mtalexan
Copy link
Contributor Author

The build error appears to be the same issue I'm seeing separately on main right now. The buildextend-qemu just arbitrarily exits suddenly with no errors or output. I thought it was just me, but it appears to be the current status on main.

@jlebon
Copy link
Member

jlebon commented Oct 25, 2023

Hmm weird. We did have a few successful merges since then. Possibly some bug in a dependency that was recently fixed? Anyway, I restarted CI here since the last run was a few days ago. If it fails again, we'll dig more into it.

auto-merge was automatically disabled October 25, 2023 23:31

Head branch was pushed to by a user without write access

@mtalexan mtalexan dismissed stale reviews from dustymabe and jlebon via 5a8a78a October 25, 2023 23:31
@mtalexan mtalexan force-pushed the fix-buildfetch-image-overwrite branch from a328f2c to 5a8a78a Compare October 25, 2023 23:31
@mtalexan
Copy link
Contributor Author

Never mind, I found the issue and it was in my code. The extra shift at the end of my new line in cmdlib.sh errors out if there's no second argument specified. The buildextend-qemu is the first time the function is called without the extra argument.

I've squashed the new fix, the pytest.ini test coverage reduction, and my original changes all into 1 commit.

When importing the previous commit during `cosa build`, we would clobber
`image.json` file we just derived with the one from the imported commit.
Fix this by telling `import_ostree_commit()` to skip JSON extraction on
this execution path, but leave it as-is for other execution paths.

Also drop required pytest coverage to 70% since this causes a drop below
the old threshold but nothing touched can really be tested.
@jlebon jlebon force-pushed the fix-buildfetch-image-overwrite branch from 5a8a78a to b846989 Compare October 26, 2023 13:42
@jlebon jlebon enabled auto-merge (rebase) October 26, 2023 13:43
@jlebon
Copy link
Member

jlebon commented Oct 26, 2023

/ok-to-test

@jlebon jlebon merged commit 138e2df into coreos:main Oct 26, 2023
@mtalexan mtalexan deleted the fix-buildfetch-image-overwrite branch October 30, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

buildfetch'd image.yaml always used instead of the one from the config
3 participants