Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does't work with provided examples (examples/minimal) #82

Open
dmumpuu opened this issue Oct 25, 2023 · 1 comment
Open

Does't work with provided examples (examples/minimal) #82

dmumpuu opened this issue Oct 25, 2023 · 1 comment

Comments

@dmumpuu
Copy link

dmumpuu commented Oct 25, 2023

Steps to reproduce:

  1. Clone repository and cd terraform-aws-metaflow/examples/minimal
  2. Set locals.resource_prefix = "test-metaflow" in minimal_example.tf
  3. Run terraform apply and wait until it finishes
  4. Run aws apigateway get-api-key --api-key <api-key> --include-value | grep value and paste the result to the metaflow_profile.json file
  5. Import Metaflow configuration: metaflow configure import metaflow_profile.json
  6. Run python mftest.py run
mftest.py
from metaflow import FlowSpec, step, batch, resources


class MfTest(FlowSpec):
    @step
    def start(self):
        print("Started")
        self.next(self.run_batch)

    @batch
    @resources(cpu=1, memory=1_000)
    @step
    def run_batch(self):
        print("Hello from @batch")
        self.next(self.end)

    @step
    def end(self):
        print("Finished")

if __name__ == '__main__':
    MfTest()

The running task will never finish, created AWS Batch Job in the AWS Batch Job queue is always in status RUNNABLE

Also tried with outerbounds/metaflow/aws version=0.10.1 and terraform-aws-modules/vpc/aws version = 5.1.2

Generated metaflow_profile.json
{
  "METAFLOW_BATCH_JOB_QUEUE": "arn:aws:batch:<region>:<account>:job-queue/test-metaflow-<random>",
  "METAFLOW_DATASTORE_SYSROOT_S3": "s3://test-metaflow-s3-<random>/metaflow",
  "METAFLOW_DATATOOLS_S3ROOT": "s3://test-metaflow-s3-<random>/data",
  "METAFLOW_DEFAULT_DATASTORE": "s3",
  "METAFLOW_DEFAULT_METADATA": "service",
  "METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam::<account>:role/test-metaflow-batch_s3_task_role-<random>",
  "METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE": "",
  "METAFLOW_SERVICE_AUTH_KEY": <get-api-key-result>,
  "METAFLOW_SERVICE_INTERNAL_URL": "http://test-metaflow-nlb-<random>-<random>.elb.<region>.amazonaws.com/",
  "METAFLOW_SERVICE_URL": "https://<random>.execute-api.<region>.amazonaws.com/api/",
  "METAFLOW_SFN_DYNAMO_DB_TABLE": "",
  "METAFLOW_SFN_IAM_ROLE": "",
  "METAFLOW_SFN_STATE_MACHINE_PREFIX": "test-metaflow-<random>"
}
@vfilter
Copy link

vfilter commented Mar 2, 2024

Edit: It's even worse, the whole thing now cannot be destroyed with the Terraform CLI. That means I gotta manually go in and delete the resources. 👎

The examples are outdated and don't work. I tried the eks_argo TF example, and it would bork out with

│ Warning: Argument is deprecated
│
│   with module.metaflow-datastore.aws_s3_bucket.this,
│   on .terraform/modules/metaflow-datastore/modules/datastore/s3.tf line 1, in resource "aws_s3_bucket" "this":
│    1: resource "aws_s3_bucket" "this" {
│
│ Use the aws_s3_bucket_server_side_encryption_configuration resource instead
│
│ (and one more similar warning elsewhere)
╵
╷
│ Error: creating Lambda Function (metaflowdb_migrateir9nhhph): operation error Lambda: CreateFunction, https response error StatusCode: 400, RequestID: XXX, InvalidParameterValueException: The runtime parameter of python3.7 is no longer supported for creating or updating AWS Lambda functions. We recommend you use the new runtime (python3.12) while creating or updating functions.
│
│   with module.metaflow-metadata-service.aws_lambda_function.db_migrate_lambda,
│   on .terraform/modules/metaflow-metadata-service/modules/metadata-service/lambda.tf line 115, in resource "aws_lambda_function" "db_migrate_lambda":
│  115: resource "aws_lambda_function" "db_migrate_lambda" {

Also, the EKS version in the examples is outdated and is unsupported from March 2024. I understand that these are simply some first steps, but still a bit disappointing. We built our own Metaflow deployment with AWS CDK but CDK got its own issues and AWS step functions is excruciatingly slow, so I was really hoping for some speed improvements using Argo + k8s + TF both for deployment of the infrastructure and for deployment of workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants