From 2066e0f0345e87e7fc3e51374dec1abf9983823c Mon Sep 17 00:00:00 2001 From: Ben Galewsky Date: Wed, 12 Jun 2024 14:51:38 -0500 Subject: [PATCH 1/5] Add instructions on deploying MDF --- README.md | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/README.md b/README.md index 41e6a14..eff6fb2 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,91 @@ # MDF Connect The Materials Data Facility Connect service is the ETL flow to deeply index datasets into MDF Search. It is not intended to be run by end-users. To submit data to the MDF, visit the [Materials Data Facility](https://materialsdatafacility.org). +# Architecture +The MDF Connect service is a serverless REST service that is deployed on AWS. +It consists of an AWS API Gateway that uses a lambda function to authenticate +requests against GlobusAuth. If authorised, the endpoints trigger AWS lambda +functions. Each endpoint is implemented as a lambda function contained in a +python file in the [aws/](aws/) directory. The lambda functions are deployed +via GitHub actions as described in a later section. + +The API Endpoints are: +* [POST /submit](aws/submit.py): Submits a dataset to the MDF Connect service. This triggers a Globus Automate flow +* [GET /status](aws/status.py): Returns the status of a dataset submission +* [POST /submissions](aws/submissions.py): Forms a query and returns a list of submissions + +# Globus Automate Flow +The Globus Automate flow is a series of steps that are triggered by the POST +/submit endpoint. The flow is defined using a python dsl that can be found +in [automate/minimus_mdf_flow.py](automate/minimus_mdf_flow.py). At a high +level the flow: +1. Notifies the admin that a dataset has been submitted +2. Checks to see if the data files have been updated or if this is a metadata only submission +3. If there is a dataset, it starts a globus transfer +4. Once the transfer is complete it may trigger a curation step if the organization is configured to do so +5. A DOI is minted if the organization is configured to do so +6. The dataset is indexed in MDF Search +7. The user is notified of the completion of the submission + + +# Development Workflow +Changes should be made in a feature branch based off of the dev branch. Create +PR and get a friend to review your changes. Once the PR is approved, merge it +into the dev branch. The dev branch is automatically deployed to the dev +environment. Once the changes have been tested in the dev environment, create a +PR from dev to main. Once the PR is approved, merge it into main. The main +branch is automatically deployed to the prod environment. + +# Deployment +The MDF Connect service is deployed on AWS into development and production +environments. The automate flow is deployed into the Globus Automate service via +a second GitHub action. + +## Deploy the Automate Flow +Changes to the automate flow are deployed via a GitHub action, triggered by the +push of a new GitHub release. If the release is tagged as "pre-release" it will +be deployed to the dev environment, otherwise it will be deployed to the prod +environment. + +The flow IDs for dev and prod are stored in +[automate/mdf_dev_flow_info.json](automate/mdf_dev_flow_info.json) and +[automate/mdf_prod_flow_info.json](automate/mdf_prod_flow_info.json) +respectively. The flow ID is stored in the `flow_id` key. + +### Deploy a Dev Release of the Flow +1. Merge your changes into the `dev` branch +2. On the GitHub website, click on the _Release_ link on the repo home page. +3. Click on the _Draft a new release_ button +4. Fill in the tag version as `X.Y.Z-alpha.1` where X.Y.Z is the version number. You can use subsequent alpha tags if you need to make further changes. +5. Fill in the release title and description +6. Select `dev` as the Target branch +7. Check the _Set as a pre-release_ checkbox +8. Click the _Publish release_ button + +### Deploy a Prod Release of the Flow +1. Merge your changes into the `main` branch +2. On the GitHub website, click on the _Release_ link on the repo home page. +3. Click on the _Draft a new release_ button +4. Fill in the tag version as `X.Y.Z` where X.Y.Z is the version number. +5. Fill in the release title and description +6. Select `main` as the Target branch +7. Check the _Set as the latest release_ checkbox +8. Click the _Publish release_ button + +You can verify deployment of the flows in the +[Globus Automate Console](https://app.globus.org/flows/library). + +## Deploy the MDF Connect Service +The MDF Connect service is deployed via a GitHub action. The action is triggered +by a push to the dev or main branch. The action will deploy the service to the +dev or prod environment respectively. + +## Updating Schemas +Schemas and the MDF organization database are managed in the automate branch +of the [Data Schemas Repo](https://github.com/materials-data-facility/data-schemas/tree/automate). + +The schema is deployed into the docker images used to serve up the lambda +functions. # Running Tests To run the tests first make sure that you are running python 3.7.10. Then install the dependencies: From 028dbdf4c50836f69c05c8e44994149ff2499ad5 Mon Sep 17 00:00:00 2001 From: Owen Price Skelly <21372141+OwenPriceSkelly@users.noreply.github.com> Date: Wed, 17 Jul 2024 09:41:27 -0500 Subject: [PATCH 2/5] rename domain --- aws/tests/test_automate_manager.py | 16 ++++++++-------- infra/mdf/dev/variables.tf | 2 +- infra/mdf/prod/variables.tf | 2 +- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/aws/tests/test_automate_manager.py b/aws/tests/test_automate_manager.py index e60f161..af83e2c 100644 --- a/aws/tests/test_automate_manager.py +++ b/aws/tests/test_automate_manager.py @@ -76,7 +76,7 @@ def set_environ(self): @mock.patch('globus_automate_flow.GlobusAutomateFlow', autospec=True) def test_create_transfer_items(self, _, secrets, organization, set_environ): - os.environ['PORTAL_URL'] = "https://acdc.alcf.anl.gov/mdf/detail/" + os.environ['PORTAL_URL'] = "https://materialsdatafacility.org/detail/" manager = AutomateManager(secrets, is_test=False) data_sources = [ @@ -103,7 +103,7 @@ def test_create_transfer_items(self, _, secrets, organization, set_environ): @mock.patch('globus_automate_flow.GlobusAutomateFlow', autospec=True) def test_create_transfer_items_from_origin(self, _, secrets, organization): - os.environ['PORTAL_URL'] = "https://acdc.alcf.anl.gov/mdf/detail/" + os.environ['PORTAL_URL'] = "https://materialsdatafacility.org/detail/" manager = AutomateManager(secrets, is_test=False) data_sources = [ @@ -126,7 +126,7 @@ def test_create_transfer_items_from_origin(self, _, secrets, organization): @mock.patch('globus_automate_flow.GlobusAutomateFlow', autospec=True) def test_create_transfer_items_from_google_drive(self, _, secrets, organization): - os.environ['PORTAL_URL'] = "https://acdc.alcf.anl.gov/mdf/detail/" + os.environ['PORTAL_URL'] = "https://materialsdatafacility.org/detail/" os.environ['GDRIVE_EP'] = "f00dfd6c-edf4-4c8b-a4b1-be6ad92a4fbb" os.environ['GDRIVE_ROOT'] = "/Shared With Me" manager = AutomateManager(secrets, is_test=False) @@ -151,7 +151,7 @@ def test_create_transfer_items_from_google_drive(self, _, secrets, organization) @mock.patch('globus_automate_flow.GlobusAutomateFlow', autospec=True) def test_create_transfer_items_test_submit(self, _, secrets, organization, set_environ): - os.environ['PORTAL_URL'] = "https://acdc.alcf.anl.gov/mdf/detail/" + os.environ['PORTAL_URL'] = "https://materialsdatafacility.org/detail/" manager = AutomateManager(secrets, is_test=True) data_sources = [ @@ -177,7 +177,7 @@ def test_create_transfer_items_test_submit(self, _, secrets, organization, set_e def test_update_metadata_only(self, mock_automate, secrets, organization, mocker, mdf_rec): mock_flow = mocker.Mock() mock_automate.from_existing_flow = mocker.Mock(return_value=mock_flow) - os.environ['PORTAL_URL'] = "https://acdc.alcf.anl.gov/mdf/detail/" + os.environ['PORTAL_URL'] = "https://materialsdatafacility.org/detail/" manager = AutomateManager(secrets, is_test=False) data_sources = [ @@ -201,7 +201,7 @@ def test_update_metadata_only(self, mock_automate, secrets, organization, mocker def test_mint_doi(self, mock_automate, secrets, organization_mint_doi, mocker, mdf_rec, set_environ): mock_flow = mocker.Mock() mock_automate.from_existing_flow = mocker.Mock(return_value=mock_flow) - os.environ['PORTAL_URL'] = "https://acdc.alcf.anl.gov/mdf/detail/" + os.environ['PORTAL_URL'] = "https://materialsdatafacility.org/detail/" manager = AutomateManager(secrets, is_test=False) assert manager.datacite_username == "datacite_prod_usrname_1234" assert manager.datacite_password == "datacite_prod_passwrd_1234" @@ -232,7 +232,7 @@ def test_mint_doi(self, mock_automate, secrets, organization_mint_doi, mocker, m def test_mdf_portal_link(self, mock_automate, secrets, organization_mint_doi, mocker, mdf_rec, set_environ): mock_flow = mocker.Mock() mock_automate.from_existing_flow = mocker.Mock(return_value=mock_flow) - os.environ['PORTAL_URL'] = "https://acdc.alcf.anl.gov/mdf/detail/" + os.environ['PORTAL_URL'] = "https://materialsdatafacility.org/detail/" manager = AutomateManager(secrets, is_test=True) data_sources = [ @@ -249,5 +249,5 @@ def test_mdf_portal_link(self, mock_automate, secrets, organization_mint_doi, mo update_metadata_only=False) mock_flow.run_flow.assert_called() - assert(mock_flow.run_flow.call_args[0][0]['mdf_portal_link'] == "https://acdc.alcf.anl.gov/mdf/detail/123-456-7890-1.0.1") + assert(mock_flow.run_flow.call_args[0][0]['mdf_portal_link'] == "https://materialsdatafacility.org/detail/123-456-7890-1.0.1") diff --git a/infra/mdf/dev/variables.tf b/infra/mdf/dev/variables.tf index 91082f0..0595804 100644 --- a/infra/mdf/dev/variables.tf +++ b/infra/mdf/dev/variables.tf @@ -29,7 +29,7 @@ variable "env_vars" { GDRIVE_ROOT="/Shared With Me" MANAGE_FLOWS_SCOPE="https://auth.globus.org/scopes/eec9b274-0c81-4334-bdc2-54e90e689b9a/manage_flows" MONITOR_BY_GROUP="urn:globus:groups:id:5fc63928-3752-11e8-9c6f-0e00fd09bf20" - PORTAL_URL="https://acdc.alcf.anl.gov/mdf/detail/" + PORTAL_URL="https://materialsdatafacility.org/detail/" RUN_AS_SCOPE="0c7ee169-cefc-4a23-81e1-dc323307c863" SEARCH_INDEX_UUID="ab71134d-0b36-473d-aa7e-7b19b2124c88" TEST_DATA_DESTINATION="globus://f10a69a9-338c-4e5b-baa1-0dc92359ab47/mdf_testing/" diff --git a/infra/mdf/prod/variables.tf b/infra/mdf/prod/variables.tf index 7e6a8b0..bb81263 100644 --- a/infra/mdf/prod/variables.tf +++ b/infra/mdf/prod/variables.tf @@ -29,7 +29,7 @@ variable "env_vars" { GDRIVE_ROOT="/Shared With Me" MANAGE_FLOWS_SCOPE="https://auth.globus.org/scopes/eec9b274-0c81-4334-bdc2-54e90e689b9a/manage_flows" MONITOR_BY_GROUP="urn:globus:groups:id:5fc63928-3752-11e8-9c6f-0e00fd09bf20" - PORTAL_URL="https://acdc.alcf.anl.gov/mdf/detail/" + PORTAL_URL="https://materialsdatafacility.org/detail/" RUN_AS_SCOPE="4c37a999-da4b-4969-b621-58bfb243c5bc" SEARCH_INDEX_UUID="1a57bbe5-5272-477f-9d31-343b8258b7a5" TEST_DATA_DESTINATION="globus://f10a69a9-338c-4e5b-baa1-0dc92359ab47/mdf_testing/" From 3bd43f9b69196662e5775e305328cad5a3de2527 Mon Sep 17 00:00:00 2001 From: Ben Galewsky Date: Wed, 17 Jul 2024 09:59:05 -0500 Subject: [PATCH 3/5] Update minimum terraform version --- infra/mdf/dev/main.tf | 2 +- infra/mdf/prod/main.tf | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/infra/mdf/dev/main.tf b/infra/mdf/dev/main.tf index 22de787..4c3ae04 100644 --- a/infra/mdf/dev/main.tf +++ b/infra/mdf/dev/main.tf @@ -7,7 +7,7 @@ terraform { version = "~> 4.0.0" } } - required_version = "~> 1.5.5" + required_version = "~> 1.9.2" backend "s3" { # Replace this with your bucket name! diff --git a/infra/mdf/prod/main.tf b/infra/mdf/prod/main.tf index 2c19425..d22913c 100644 --- a/infra/mdf/prod/main.tf +++ b/infra/mdf/prod/main.tf @@ -7,7 +7,7 @@ terraform { version = "~> 4.0.0" } } - required_version = "~> 1.5.5" + required_version = "~> 1.9.2" backend "s3" { # Replace this with your bucket name! From 17263295b8f5fb1fddcfd00621d594ce939ccfb9 Mon Sep 17 00:00:00 2001 From: Ben Galewsky Date: Wed, 17 Jul 2024 10:05:30 -0500 Subject: [PATCH 4/5] Pin pytest to avoid new deprication --- aws/requirements_test.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aws/requirements_test.txt b/aws/requirements_test.txt index b13474b..096d7cd 100644 --- a/aws/requirements_test.txt +++ b/aws/requirements_test.txt @@ -1,4 +1,4 @@ -pytest +pytest<8.0 pytest-bdd pytest-mock boto3 From fe61b5029f265e472f1f22ec14f2d2aa6f3debb5 Mon Sep 17 00:00:00 2001 From: Ben Galewsky Date: Wed, 17 Jul 2024 10:47:46 -0500 Subject: [PATCH 5/5] Pin pytest in correct requirements file, delete confusing disused file --- aws/requirements_test.txt | 4 ---- aws/tests/requirements-test.txt | 2 +- 2 files changed, 1 insertion(+), 5 deletions(-) delete mode 100644 aws/requirements_test.txt diff --git a/aws/requirements_test.txt b/aws/requirements_test.txt deleted file mode 100644 index 096d7cd..0000000 --- a/aws/requirements_test.txt +++ /dev/null @@ -1,4 +0,0 @@ -pytest<8.0 -pytest-bdd -pytest-mock -boto3 diff --git a/aws/tests/requirements-test.txt b/aws/tests/requirements-test.txt index 758ce7d..7cc451b 100644 --- a/aws/tests/requirements-test.txt +++ b/aws/tests/requirements-test.txt @@ -1,4 +1,4 @@ -pytest +pytest<8.0 pytest-mock pytest-bdd==4.1.0 git+https://github.com/materials-data-facility/connect_client.git@v0.4.0-dev