Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Batch compute environment needs recreating after launch template change #15535

Closed
microbioticajon opened this issue Oct 7, 2020 · 9 comments · Fixed by #30438
Closed
Labels
service/batch Issues and PRs that pertain to the batch service.

Comments

@microbioticajon
Copy link

microbioticajon commented Oct 7, 2020

Hi Guys,

Im having trouble applying changes to my AWS Batch configuration. As part of my batch cluster I use a custom Launch Template for the instances in the compute environment. However when I make a change to the Launch Template the Batch compute environment remains un-modified.

Terraform version

v0.13.3

  • provider registry.terraform.io/-/aws v3.8.0
  • provider registry.terraform.io/hashicorp/aws v3.8.0
  • provider registry.terraform.io/hashicorp/null v2.1.2

Affected Resource(s)

  • aws_batch_compute_environment
  • aws_launch_template

Expected Behaviour

According to the AWS Batch docs, if the Launch Template is updated with a new version, the entire compute environment needs to be destroyed and rebuilt:

https://docs.aws.amazon.com/batch/latest/userguide/launch-templates.html

Launch Template Support - AWS Batch
AWS Batch does not support updating a compute environment with a new launch template version. If you update your launch template, you must create a new compute environment with the new template for the changes to take effect.

Actual behaviour

aws_compute_environment remains unchanged

As a result, the only way to apply changes to the Launch Template is to manually destroy the compute environment before applying the plan or taint the resources through the command line.

I performed a quick search and I cannot find a way to trigger a forced re-create on a resource within the plan itself.

Any fixes, help or work-arounds would be greatly appreciated.

Note:

My current launch template has resulted in an invalid compute environment which cannot be deleted even when tainted which is why I need to update the launch template. See: #8549

@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Oct 7, 2020
@ghost ghost added service/batch Issues and PRs that pertain to the batch service. service/ec2 Issues and PRs that pertain to the ec2 service. labels Oct 7, 2020
@ewbankkit ewbankkit removed service/ec2 Issues and PRs that pertain to the ec2 service. needs-triage Waiting for first response or review from a maintainer. labels Oct 13, 2020
@ewbankkit
Copy link
Contributor

@microbioticajon Thanks for raising this issue.
Could you please include a snippet of your Terraform configuration that includes the setting of laucnh_template.version?

@microbioticajon
Copy link
Author

Hi @ewbankkit,

That was it! The compute environment was relying on the default launch template but terraform was unable to detect the change unless launch_template.version was set. It looks like this is not directly a problem after all, apologies.

While obvious now I think about it, a hint in the docs might helps others who get stuck with the same issue.

I have reapplied the plan with a modified launch template but unfortunately Im now getting the following related error:

# module.test_cluster.data.aws_ebs_snapshot.static_refs will be read during apply
# module.test_cluster.aws_batch_compute_environment.main must be replaced
...
            launch_template {
                launch_template_id = "lt-0929..."
                version            = "1" -> (known after apply) # forces replacement
            }
...
# module.test_cluster.aws_batch_job_queue.general_purpose_queue will be updated in-place
# module.test_cluster.aws_launch_template.worker will be updated in-place
# module.test_cluster.aws_launch_template.worker_ebs_working_vol will be updated in-place

Plan: 1 to add, 3 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.test_cluster.data.aws_ebs_snapshot.static_refs: Refreshing state...
module.test_cluster.aws_batch_compute_environment.main: Destroying... [id=dev-jon-tf-cluster]
module.test_cluster.aws_launch_template.worker_ebs_working_vol: Modifying... [id=lt-0d61...]
module.test_cluster.aws_launch_template.worker_ebs_working_vol: Modifications complete after 0s [id=lt-0d61...]

Error: error deleting Batch Compute Environment (dev-jon-tf-cluster): : Cannot delete, found existing JobQueue relationship
	status code: 400, request id: 77a9x962-0dc8-4edc-88d3-effec3071d0d

Im not sure how to get around this - it looks like TF now recognises that the compute environment needs to be replaced but AWS wont let it while there are still queues associated with it.

Many thanks,
Jon

@vspinu
Copy link

vspinu commented Feb 19, 2021

I am seeing the same error even after manually destroying batch environments in the console. Any ideas of how to reset the (remote) state without re-initializing the project from scratch?

Error: error disabling Batch Compute Environment (dev-batch-cpu4-20210217103010): : arn:aws:batch:eu-central-1:1111111111111:compute-environment/xyz does not exist
	status code: 400, request id: aa1e66ad-e358-43c0-8ebb-8a3cfefc92e7

EDIT: fixed it with an explicit terraform state rm x y z

@bhayden53
Copy link

bhayden53 commented Mar 19, 2021

It looks like using launch_template.version = $Latest does not force compute environment re-creation even when terraform knows the launch template is being updated. Shouldn't it?

Otherwise I have to lookup the current launch template value and increment it manually every time I deploy, just to get a new compute environment made correctly?

I mean, the way I understand it, anytime terraform makes any change to a launch template, it should just remake any associated compute environments. Even if you use $Default or $Latest Batch only takes a snapshot of them at the time of compute environment creation; it won't dynamically recognize changes to $Latest or $Default over time.

https://docs.aws.amazon.com/batch/latest/userguide/create-compute-environment.html

After the compute environment is created, the launch template version used will not be changed, even if the $Default or $Latest version for the launch template is updated. To use a new launch template version, create a new compute environment, add the new compute environment to the existing job queue, remove the old compute environment from the job queue, and delete the old compute environment.

@bhayden53
Copy link

I think the only reliable solution to this in my situation, is for my deployment to mark the compute environment as tainted every time in order to force re-creation.

@AaronNHart
Copy link

@bhayden53 I have been struggling with this for about a (painful) year but just noticed a small improvement from using $Latest. You can instead use aws_launch_template.this.latest_version which simply replaces $Latest with the latest version number. This allows terraform to recognize that the CE needs to be replaced. I honestly don't understand what AWS thinks $Latest (and $Default) actually do in compute environments currently. It seems completely broken to me.

However, the issue that @vspinu raises I see often and do not understand the root cause. It seems to me like a bug in the provider, specifically that it doesn't know that the queue must be deleted before the compute environment can be replaced. I suspect this is just a limitation of the AWS API and must be adapted to in the provider.

@ewbankkit if a small reproducible example would help I can provide one. It would be fantastic if we can find a solution.

@bhayden53
Copy link

I honestly don't understand what AWS thinks $Latest (and $Default) actually do in compute environments currently. It seems completely broken to me.

AWS Support has told me that it intentionally takes a snapshot of the $Latest or $Default version at the time of CE creation. Definitely not what any reasonable user would expect it to do. I think I also got the "there is an issue in our internal tracker and I have added your voice to it" response as well.

Thanks for the other workaround.

@frosforever
Copy link
Contributor

This looks like it might be related to #30438.

Since this issue was first opened, Batch behavior has changed and now allows updates to compute environment launch templates if "the service role is set to AWSServiceRoleForBatch (the default) and that the allocation strategy is BEST_FIT_PROGRESSIVE or SPOT_CAPACITY_OPTIMIZED. BEST_FIT isn't supported."

See https://docs.aws.amazon.com/batch/latest/userguide/updating-compute-environments.html

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
service/batch Issues and PRs that pertain to the batch service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants