Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cdk-pipelines] Infinite-loop in self-mutating pipeline #32008

Open
1 task done
BwL1289 opened this issue Nov 4, 2024 · 12 comments
Open
1 task done

[cdk-pipelines] Infinite-loop in self-mutating pipeline #32008

BwL1289 opened this issue Nov 4, 2024 · 12 comments
Assignees
Labels
@aws-cdk/pipelines CDK Pipelines library bug This issue is a bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. p3 potential-regression Marking this issue as a potential regression to be checked by team member

Comments

@BwL1289
Copy link

BwL1289 commented Nov 4, 2024

Describe the bug

The pipeline's 'UpdatePipeline' stage succeeds and the pipeline restarts, which it is expected to do once when infrastructure is updated. However, when it restarts, it updates itself again. This loops infinitely, and the pipeline never reaches the Assets stage. It appears to happen on V2, with SUPERSEDED mode, and restart_execution_on_update=True.

I could not reproduce it on V2 with PARALLEL mode and restart_execution_on_update=False.

This appears to be caused by the docker asset hashes changing while the rest of the template stays the same. Synth is not introducing nondeterminism and the dockerfiles and directories are exactly the same between runs, but the hashes keep changing.

I initially reported this here and here.

This may be a regression of #9766 and the fix here.

#9080 is likely also related.

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

The pipeline to self mutate once and then continue to Assets stage at the next cycle

Current Behavior

The pipeline continuously self mutates in an infinite loop.

Reproduction Steps

On codepipeline V2, use restart_execution_on_update=True in SUPERSEDED mode.

Possible Solution

No response

Additional Information/Context

I reverted to PARALLEL mode from SUPERSEDED and restart_execution_on_update=False.

It's now able to progress to Assets stage. I'm on version 2.164.1.

For context:

  1. For months I was using SUPERSEDED mode with restart_execution_on_update=True on codepipeline V1.
  2. I recently migrated to V2, using PARALLEL mode and restart_execution_on_update=False.
    1. This worked
  3. Then I switched to SUPERSEDED mode with restart_execution_on_update=False.
    1. This did not work as the build would get cancelled (expected) after self-mutate, and I'd have to restart it manually, and then it would get cancelled again (unexpected)
  4. Then I switched to SUPERSEDED mode with restart_execution_on_update=True.
    1. This did not work as it would enter an endless loop between synth and self-mutate
  5. Then I switched back to PARALLEL mode and restart_execution_on_update=False.
    1. This worked, again

I would like to switch back to SUPERSEDED mode with restart_execution_on_update=True as SUPERSEDED mode supports building 50 docker assets in parallel while PARALLEL only supports 5, and I'd like to not worry about the pipeline restarting after infra changes.

CDK CLI Version

2.164.1

Framework Version

No response

Node.js Version

v20.15.1

OS

MacOS

Language

Python

Language Version

No response

Other information

No response

@BwL1289 BwL1289 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 4, 2024
@github-actions github-actions bot added @aws-cdk/pipelines CDK Pipelines library potential-regression Marking this issue as a potential regression to be checked by team member labels Nov 4, 2024
@pahud pahud self-assigned this Nov 4, 2024
@pahud
Copy link
Contributor

pahud commented Nov 4, 2024

Hi, thank you for your report.

This may be a regression of #9766 and the fix here.

That fix was in 2020 and I believe it was a fix for CDK v1. Are we talking about a potential regression from a PR in 2020 and you are still affected even in 2.164.1?

Reproduction Steps
On codepipeline V2, use restart_execution_on_update=True in SUPERSEDED mode.

Are you able to provide a minimal code snippets that I can paste into my IDE and reproduce it? This would be very helpful to help us what's the best to do next.

Thank you.

@pahud pahud added the p3 label Nov 4, 2024
@pahud pahud removed their assignment Nov 4, 2024
@pahud pahud removed the needs-triage This issue or PR still needs to be triaged. label Nov 4, 2024
@BwL1289
Copy link
Author

BwL1289 commented Nov 5, 2024

@pahud I wasn't aware it was for CDKv1. In that case, no, it would not be a regression.

Minimal snippet is below.

from aws_cdk import aws_codebuild as codebuild
from aws_cdk import aws_codepipeline as codepipeline
from aws_cdk import pipelines
from constructs import Construct


class PipelineReproExample(Construct):
    def __init__(self, scope: Construct, id: str, **kwargs) -> None:
        super().__init__(scope, id, **kwargs)

        build_spec = codebuild.BuildSpec.from_object(
            {
                "version": 0.2,
                "phases": {
                    "install": {
                        "commands": [
                            "echo Starting pipeline...",
                        ],
                    },
                },
            }
        )
    
        build_env = codebuild.BuildEnvironment(
            build_image=codebuild.LinuxArmBuildImage.from_code_build_image_id("aws/codebuild/amazonlinux2-aarch64-standard:3.0"
            ),
            compute_type=codebuild.ComputeType.SMALL,
            privileged=True,
        )
    
        codebuild_defaults = pipelines.CodeBuildOptions(
            cache=codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER),
            build_environment=build_env,
            partial_build_spec=build_spec,
        )
    
        l2_codepipleine_pipeline = codepipeline.Pipeline(
            scope,
            "CodePipelineL2",
            pipeline_type=codepipeline.PipelineType.V2,
            cross_account_keys=True,
            execution_mode=codepipeline.ExecutionMode.SUPERSEDED,  # Note:PARALLEL mode only supports 5 docker assets built in parallel, while SUPERSEDED mode supports 50 docker assets built in parallel.
            restart_execution_on_update=True,  # Restarts the pipeline when it's updated by self mutation.
            reuse_cross_region_support_stacks=True,
        )
    
        synth_step = pipelines.CodeBuildStep(
            "SynthStep",
            commands=[
                "cdk synth <YOUR_STACK> -vvvvv --debug=true --trace --validation=true --long=true",
            ],
            cache=codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER),
        )
    
        pipeline = pipelines.CodePipeline(
            scope,
            "CodePipelineBasePipeline",
            code_pipeline=l2_codepipleine_pipeline,
            synth=synth_step,
            code_build_defaults=codebuild_defaults,
        )

@pahud
Copy link
Contributor

pahud commented Nov 12, 2024

Minimal snippet is below. I haven't tested this snippet but it should suffice.

Hi,

We need a minimal code snippet that you can confirm it reproduces the error in your environment so we can better address with that. Can you make sure the code you provided reproduces the behavior in your environment using CDK 2.164.1?

And, as I do not have dockerd-entrypoint.sh in my environment, wondering what is the root cause of the failure? Is it failing at dockerd-entrypoint.sh ?

@BwL1289
Copy link
Author

BwL1289 commented Nov 12, 2024

@pahud I've edited the snippet slightly to make this as easy as possible. I can confirm this reproduces the error. You will need to add building custom docker images. Giving you all of that code would be prohibitive, so I haven't included it. It's straightforward.

And, as I do not have dockerd-entrypoint.sh in my environment...

As I said in the ticket, it loops infinitely. There is no error thrown. I've edited the snippet to remove that line.

@pahud
Copy link
Contributor

pahud commented Nov 13, 2024

@BwL1289 Thank you. I will validate this today and circle back.

@pahud pahud self-assigned this Nov 13, 2024
@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed potential-regression Marking this issue as a potential regression to be checked by team member labels Nov 13, 2024
@github-actions github-actions bot added the potential-regression Marking this issue as a potential regression to be checked by team member label Nov 13, 2024
@BwL1289
Copy link
Author

BwL1289 commented Nov 13, 2024

@pahud let me know how else I can help. If this is somehow user error, I want to know.

@pahud
Copy link
Contributor

pahud commented Nov 13, 2024

Hi @BwL1289

I was not able to reproduce the loop issue and it went pretty well. See screenshot below:

image

I updated your code though as it could not run in my env. Check out my full code below:

class PipelineReproExample(Construct):
    def __init__(self, scope: Construct, id: str, **kwargs) -> None:
        super().__init__(scope, id, **kwargs)
        
        
        source_bucket = s3.Bucket(
          self,
          "SourceBucket",
          versioned=True,
          removal_policy=RemovalPolicy.DESTROY,
          auto_delete_objects=True,          
        )
        
        # CfnOutupt the bucket Name
        CfnOutput(self, 'SourceBucketName', value=source_bucket.bucket_name )

        build_spec = codebuild.BuildSpec.from_object(
            {
                "version": 0.2,
                "phases": {
                    "install": {
                        "commands": [
                            "echo Starting pipeline...",
                        ],
                    },
                },
            }
        )
    
        build_env = codebuild.BuildEnvironment(
            build_image=codebuild.LinuxArmBuildImage.from_code_build_image_id("aws/codebuild/amazonlinux2-aarch64-standard:3.0"
            ),
            compute_type=codebuild.ComputeType.SMALL,
            privileged=True,
        )
    
        codebuild_defaults = pipelines.CodeBuildOptions(
            cache=codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER),
            build_environment=build_env,
            partial_build_spec=build_spec,
        )
    
        l2_codepipleine_pipeline = codepipeline.Pipeline(
            self,
            "CodePipelineL2",
            pipeline_type=codepipeline.PipelineType.V2,
            cross_account_keys=True,
            execution_mode=codepipeline.ExecutionMode.SUPERSEDED,  # Note:PARALLEL mode only supports 5 docker assets built in parallel, while SUPERSEDED mode supports 50 docker assets built in parallel.
            restart_execution_on_update=True,  # Restarts the pipeline when it's updated by self mutation.
            reuse_cross_region_support_stacks=True,
        )
    
        synth_step = pipelines.CodeBuildStep(
            "SynthStep",
            commands=[
                "npm install -g aws-cdk",
                "pip install -r requirements.txt",
                "cdk synth"
            ],
            input=pipelines.CodePipelineSource.s3(
              bucket=source_bucket,
              object_key="source.zip"
            ),
            cache=codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER),
        )
    
        pipeline = pipelines.CodePipeline(
            self,
            "CodePipelineBasePipeline",
            code_pipeline=l2_codepipleine_pipeline,
            synth=synth_step,
            code_build_defaults=codebuild_defaults,
        )

and in app.py

#!/usr/bin/env python3
import os
import aws_cdk as cdk
from issue_triage_py.issue_triage_py_stack import PipelineReproExample

app = cdk.App()

stack = cdk.Stack(app, "cdk-python-stack", 
                env=cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), region=os.getenv('CDK_DEFAULT_REGION')),)

PipelineReproExample(stack, "PipelineReproExample")
app.synth()

initial cdk deploy

$ cdk deploy
(will output the s3 bucket name)

zip up the source bundle and upload to that s3 bucket

$ zip -r ../source.zip . -x ".venv/*" -x "cdk.out/*" -x ".git/*"
$ aws s3 cp ../source.zip s3://<BUCKET_NAME>/source.zip
upload: ../source.zip to s3://<BUCKET_NAME>/source.zip

click the release button from the codepipeline console or just wait for the polling to trigger the pipeline

The pipeline should go through with no error. No loop happens.

Please note:

  1. in the synth_step, what's happening here is that we need codebuild to help us synthesize cdk cloud assembly. For CDK in Python, this means we need a) install CDK CLI b) pip install c) cdk synth - to successfully synthesize the assets. Check out my provided sample above for more details.
  2. Typically, you would need to add_stage() to your pipeline so your pipeline would deploy the application stage but I didn't see that in your provided code snippets. With that said, the provided code snippet would just go through the pipeline, synthesize the templates and assets, no deployment would be happening after then. But I am pretty sure no loop would happen per my screenshot.

Let me know if my provided code works for you.

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 13, 2024
Copy link

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Nov 15, 2024
@BwL1289
Copy link
Author

BwL1289 commented Nov 16, 2024

I am investigating, thanks.

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Nov 16, 2024
@BwL1289
Copy link
Author

BwL1289 commented Nov 25, 2024

Update: I've updated (still using type V2 in PARALLEL mode) to use restart_execution_on_update=True and now it's triggering a new separate pipeline build on every UpdatePipeline step. It does this in a (albeit separate) infinite loop.

Execution 1 (intentional - triggered by a commit):
Pipeline triggered -> UpdatePipeline -> Execution 2: Pipeline triggered -> UpdatePipeline

Execution 2 (unintentional - triggered by StartPipelineExecution):
Pipeline triggered -> UpdatePipeline -> Execution 3: Pipeline triggered -> UpdatePipeline

Execution 3 (unintentional - triggered by StartPipelineExecution):
Pipeline triggered -> UpdatePipeline -> Execution 4: Pipeline triggered -> UpdatePipeline

...

Either this is pure user error, there's a bug in CDK somewhere, or I'm simply missing something.

Additionally, @pahud, your example won't (or at least shouldn't) trigger the behavior I am seeing because you're not building any docker assets in your pipeline, and therefore, the asset hashes won't change. See my comment above:

I can confirm this reproduces the error. You will need to add building custom docker images. Giving you all of that code would be prohibitive, so I haven't included it.

@BwL1289
Copy link
Author

BwL1289 commented Nov 25, 2024

Another update:

I tore down the pipeline. I redeployed with V2 in SUPERSEDED mode to use restart_execution_on_update=True. It is still looping infinitely as I originally indicated when I opened this ticket. Again, this is due to the docker asset hashes changing on every UpdatePipeline step (see #9080 (comment)):

[18:23:50] <REDACTED>: checking if we can skip deploy
--
1368 | [18:23:50] StackDev: template has changed
1369 | [18:23:50] StackDev: deploying...

Some more context: I am using a custom docker image that I run all steps in. I don't know if or how this could be related to this issue.

Having tried everything, I can now only use CDK pipelines in V2 with PARALLEL and use restart_execution_on_update=False.

To summarize:

  1. Using V2 in PARALLEL mode and restart_execution_on_update=False.
    1. This worked
  2. Using V2 in SUPERSEDED mode with restart_execution_on_update=False.
    1. This did not work as the build would get cancelled after self-mutate (this is expected), and I'd have to restart it manually, and then it would get cancelled again ((this is unexpected, as it shouldn't have to mutate the pipeline again)
  3. Using V2 in SUPERSEDED mode with restart_execution_on_update=True.
    1. This did not work as it would enter an endless loop between synth and self-mutate because the docker asset hashes change on every synth.
  4. Using V2 in PARALLEL mode and restart_execution_on_update=True.
    1. Although this technically "works," as it will allow my pipeline to advance to the Assets stage, a side effect is that it creates an infinite number of new pipeline executions after every UpdatePipeline stage. I have to manual stop all new executions.
  5. Using V2 in PARALLEL mode and restart_execution_on_update=False.
    1. This worked, again

Again, the reason for all of this is because the docker asset hashes continue to be recalculated on every synth and are, for reasons I don't understand, producing different hashes.

@BwL1289
Copy link
Author

BwL1289 commented Dec 13, 2024

@pahud any update on this? At the very least, were you able to test when building with docker assets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/pipelines CDK Pipelines library bug This issue is a bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. p3 potential-regression Marking this issue as a potential regression to be checked by team member
Projects
None yet
Development

No branches or pull requests

2 participants