-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Xee support for distributed worker initialization/authentication
This adds an Earth Engine initialization check for dataset operations for remote workers to call Earth Engine. Also adds docs for submitting a Dataflow job using Xee Dataflow jobs would fail with Xee due to the remote workers not having the EE client library initialized. This adds a check to all calls for the `EarthEngineBackendArray` object so that if there is a call to EE, it will be initialized if not already. It also includes a parameter a user has to set to allow the attempt of automatically initializing Earth Engine for workers, this is so users are explicit and know that EE will try initializing. There was discussion on issue #99 regarding documentation for how to do initialize/authenticate on distributed cluster and this also includes a Dataflow example (docs, cloud setup, script) where that users can start from. close #51 PiperOrigin-RevId: 596966033
- Loading branch information
Xee authors
committed
Feb 5, 2024
1 parent
6700ed1
commit 1d15522
Showing
7 changed files
with
278 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
FROM apache/beam_python3.9_sdk:2.51.0 | ||
|
||
COPY requirements.txt ./ | ||
|
||
RUN pip install -r requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Xee Dataflow Example | ||
|
||
This example illustrates how to run an Xee Beam process using Dataflow on Google Cloud Platform. | ||
|
||
The example requires a Google Cloud account and will incur charges! | ||
|
||
## Cloud setup | ||
|
||
To begin, there is a fair amount of setup of Cloud resources to execute the workflow on a Cloud Project. | ||
|
||
This example assumes you have the [Google Cloud SDK installed](https://cloud.google.com/sdk/docs/install) and an [Earth Engine project setup with your Cloud Project](https://developers.google.com/earth-engine/cloud/earthengine_cloud_project_setup). | ||
|
||
|
||
### Set environment variables | ||
|
||
These environment variables are used throughout the example to make life easier when working across different Cloud environments. These get Cloud project info as well as set naming information for infrastructure setup in example. | ||
|
||
```shell | ||
PROJECT=$(gcloud config get-value project) | ||
|
||
REGION=us-central1 | ||
|
||
REPO=xee-dataflow | ||
CONTAINER=beam-runner | ||
|
||
SA_NAME=xee-dataflow-controller | ||
SERVICE_ACCOUNT=${SA_NAME}@${PROJECT}.iam.gserviceaccount.com | ||
``` | ||
|
||
### Create custom Docker Container with dependencies | ||
|
||
One of the suggested ways to handle external dependencies within a Beam pipeline is to [use a custom Docker Container](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#custom-containers) with the pipeline. This is useful because each remote worker will need to install dependencies when it spins up and having a pre-built container makes that much quicker. | ||
|
||
To do this with Google Cloud, one must first create an [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview) repository where the Docker Container can be stored and then build/push the container to the registry repository. | ||
|
||
To create an Artifact Registry repository run the following command: | ||
|
||
```shell | ||
gcloud artifacts repositories create $REPO \ | ||
--location=$REGION \ | ||
--repository-format=docker \ | ||
--description="Repository for hosting the Docker images to test xee with Dataflow" \ | ||
--async | ||
``` | ||
|
||
The next step is to build the Docker Container and push to the repository just created. This is done using [Cloud Build](https://cloud.google.com/build/docs/overview) with a configuration file. The config file defines how the image is built and where it is stored. | ||
|
||
The `cloudbuild.yaml` file has general variables that need to be replaced with information with your Cloud environment. Open the file in your favorite text editor and replace "REGION" with the Cloud Region you selected, "YOUR_PROJECT" with your Cloud Project ID, "REPO" with the Artifact Registry repository name, and "CONTAINER" with the container name. | ||
|
||
Alternatively, you can replace them with the following command: | ||
|
||
```shell | ||
sed -i 's/REGION/'"$REGION"'/g; s/YOUR_PROJECT/'"$PROJECT"'/g; s/REPO/'"$REPO"'/g; s/CONTAINER/'"$CONTAINER"'/g' cloudbuild.yaml | ||
``` | ||
|
||
Run the following command to build the container to use with Dataflow: | ||
|
||
```shell | ||
gcloud builds submit --config cloudbuild.yaml | ||
``` | ||
|
||
### Create custom Docker Container with dependencies | ||
|
||
This example will output data to a Cloud Storage bucket so one needs to be created | ||
for the pipeline. To do so run the following command: | ||
|
||
```shell | ||
gsutil mb -l $REGION gs://xee-out-${PROJECT} | ||
``` | ||
|
||
Cloud bucket names need to be globally unique so this uses the Cloud Project Number (also globally unique) in the name. | ||
|
||
### Create a Service Account | ||
|
||
Service Accounts (SA) are used for authorization of remote workers to make calls to different services. It is good practice to create a SA for a specific process and this is to limit the roles assigned to one individual SA required for the process. | ||
|
||
To create a SA run the following code: | ||
|
||
```shell | ||
gcloud iam service-accounts create ${SA_NAME} \ | ||
--description="Controller service account for services used with Dataflow" \ | ||
--display-name="Xee Dataflow Controller SA" | ||
``` | ||
|
||
next assign the required roles to the Service Account to properly manage workers and read/write data. | ||
|
||
```shell | ||
roles=("roles/earthengine.writer" "roles/serviceusage.serviceUsageConsumer" "roles/storage.objectAdmin" "roles/artifactregistry.reader" "roles/dataflow.worker") | ||
|
||
for role in ${roles[@]} | ||
do | ||
gcloud projects add-iam-policy-binding ${PROJECT} \ | ||
--member=serviceAccount:${SERVICE_ACCOUNT} \ | ||
--role=${role} | ||
done | ||
``` | ||
|
||
Now that all of the Cloud infrastructure is setup, it is time to run the pipeline! | ||
|
||
## Run the pipeline | ||
|
||
This example is focused on pulling data from Earth Engine, transforming the data into Zarr formats and storing the results. There is the script `ee_to_zarr_dataflow.py` script that defines the pipeline and passing command line arguments define how it is executed with Dataflow. | ||
|
||
```shell | ||
python ee_to_zarr_dataflow.py \ | ||
--input NASA/GPM_L3/IMERG_V06 \ | ||
--output gs://xee-out-${PROJECT} \ | ||
--target_chunks='time=6' \ | ||
--runner DataflowRunner \ | ||
--project $PROJECT \ | ||
--region $REGION \ | ||
--temp_location gs://xee-out-${PROJECT}/tmp/ \ | ||
--service_account_email $SERVICE_ACCOUNT \ | ||
--sdk_location=container \ | ||
--sdk_container_image=${REGION}-docker.pkg.dev/${PROJECT}/${REPO}/${CONTAINER} \ | ||
--job_name imerg-dataflow-$(date '+%Y%m%d%H%M%S') | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
steps: | ||
- name: 'gcr.io/cloud-builders/docker' | ||
args: [ 'build', '-t', 'REGION-docker.pkg.dev/YOUR_PROJECT/REPO/CONTAINER', '.' ] | ||
images: ['REGION-docker.pkg.dev/YOUR_PROJECT/REPO/CONTAINER'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Copyright 2024 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
r"""Exports EE ImageCollections to Zarr using Xarray-Beam.""" | ||
|
||
|
||
import logging | ||
from typing import Dict, List | ||
|
||
from absl import app | ||
from absl import flags | ||
import apache_beam as beam | ||
import xarray as xr | ||
import xarray_beam as xbeam | ||
import xee | ||
|
||
import ee | ||
|
||
logger = logging.getLogger(__name__) | ||
logger.setLevel(logging.INFO) | ||
|
||
|
||
_INPUT = flags.DEFINE_string( | ||
'input', '', help='The input Earth Engine ImageCollection.' | ||
) | ||
_CRS = flags.DEFINE_string( | ||
'crs', | ||
'EPSG:4326', | ||
help='Coordinate Reference System for output Zarr.', | ||
) | ||
_SCALE = flags.DEFINE_float('scale', 0.25, help='Scale factor for output Zarr.') | ||
_TARGET_CHUNKS = flags.DEFINE_string( | ||
'target_chunks', | ||
'', | ||
help=( | ||
'Chunks on the input Zarr dataset to change on the outputs, in the ' | ||
'form of a comma separated dimension=size pairs, e.g., ' | ||
"--target_chunks='x=10,y=10'. Omitted dimensions are not changed and a " | ||
'chunksize of -1 indicates not to chunk a dimension.' | ||
), | ||
) | ||
_OUTPUT = flags.DEFINE_string('output', '', help='The output zarr path.') | ||
_RUNNER = flags.DEFINE_string('runner', None, help='beam.runners.Runner') | ||
|
||
|
||
# pylint: disable=unused-argument | ||
def _parse_dataflow_flags(argv: List[str]) -> List[str]: | ||
parser = flags.argparse_flags.ArgumentParser( | ||
description='parser for dataflow flags', | ||
allow_abbrev=False, | ||
) | ||
_, dataflow_args = parser.parse_known_args() | ||
return dataflow_args | ||
|
||
|
||
# Borrowed from the xbeam examples: | ||
# https://github.com/google/xarray-beam/blob/4f4fcb965a65b5d577601af311d0e0142ee38076/examples/xbeam_rechunk.py#L41 | ||
def _parse_chunks_str(chunks_str: str) -> Dict[str, int]: | ||
chunks = {} | ||
parts = chunks_str.split(',') | ||
for part in parts: | ||
k, v = part.split('=') | ||
chunks[k] = int(v) | ||
return chunks | ||
|
||
|
||
def main(argv: list[str]) -> None: | ||
assert _INPUT.value, 'Must specify --input' | ||
assert _OUTPUT.value, 'Must specify --output' | ||
|
||
source_chunks = {'time': 24} | ||
target_chunks = dict(source_chunks, **_parse_chunks_str(_TARGET_CHUNKS.value)) | ||
|
||
ee.Initialize() | ||
|
||
input_coll = ( | ||
ee.ImageCollection(_INPUT.value) | ||
.limit(100, 'system:time_start', True) | ||
.select('precipitationCal') | ||
) | ||
|
||
ds = xr.open_dataset( | ||
input_coll, | ||
crs=_CRS.value, | ||
scale=_SCALE.value, | ||
engine=xee.EarthEngineBackendEntrypoint, | ||
) | ||
template = xbeam.make_template(ds) | ||
itemsize = max(variable.dtype.itemsize for variable in template.values()) | ||
|
||
with beam.Pipeline(runner=_RUNNER.value, argv=argv) as root: | ||
_ = ( | ||
root | ||
| xbeam.DatasetToChunks(ds, source_chunks) | ||
| xbeam.Rechunk( | ||
ds.sizes, | ||
source_chunks, | ||
target_chunks, | ||
itemsize=itemsize, | ||
) | ||
| xbeam.ChunksToZarr(_OUTPUT.value, template, target_chunks) | ||
) | ||
|
||
|
||
if __name__ == '__main__': | ||
app.run(main, flags_parser=_parse_dataflow_flags) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
xee[examples] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters