Skip to content

Commit

Permalink
Improve handling of CI image timeout when backtracking (apache#33364)
Browse files Browse the repository at this point in the history
Even the latest pip can enter into a long loop of backtracking
when trying to find latest "good" set of dependencies with
eager upgrade. This happened on August 10th 2023 with aiobotocore
causing backtracking.

This PR adds a complete set of tools and instructions that can
help in such cases and figure out which newly released dependency
causes backtracking.

The toolset consists of:

* adding timeout on the image build, so that it can fail before
  the job timeout and provide useful instructions what to do

* adding `ci find-backtracking-candidates` that allows to identify
  the packages released after the last successful constraint update
  that could be the reason for backtracking

* running the `find-backtracking-candidates` command in the CI
  when timeout occurs - this will help to see the candidates as
  early as possible - at the first build that will fail with
  timeout. This should help with narrowing down the root cause
  much faster

* adding detailed explanation why we have the problem and how to
  deal with it step-by-step, including example based on the
  August 2023 backtracking issue with aiobotocore

* finally removing `--empty-image` switch and pushing empty images
  in CI. This was an attempt to speed up waiting for image in case
  the image failed, but what it did, it has hidden the failures
  of the images when they failed. It does not really add value
  any more, since "image waiting" is now always done using small
  public runners, waiting till timeout for those is not a big issue.
  • Loading branch information
potiuk authored Aug 14, 2023
1 parent 2976a56 commit 1cf960d
Show file tree
Hide file tree
Showing 31 changed files with 752 additions and 296 deletions.
7 changes: 0 additions & 7 deletions .github/actions/build-ci-images/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,6 @@ runs:
cat "files/constraints-${PYTHON_VERSION}/*.md" >> $GITHUB_STEP_SUMMARY || true
done
if: env.UPGRADE_TO_NEWER_DEPENDENCIES != 'false'
- name: Push empty CI image ${{ env.PYTHON_MAJOR_MINOR_VERSION }}:${{ env.IMAGE }}
if: failure() || cancelled()
shell: bash
run: breeze ci-image build --push --empty-image --run-in-parallel
env:
IMAGE_TAG: ${{ env.IMAGE_TAG }}
COMMIT_SHA: ${{ github.sha }}
- name: "Fix ownership"
shell: bash
run: breeze ci fix-ownership
Expand Down
4 changes: 0 additions & 4 deletions .github/actions/build-prod-images/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,6 @@ runs:
--install-packages-from-context --upgrade-on-failure
env:
COMMIT_SHA: ${{ github.sha }}
- name: Push empty PROD images ${{ env.IMAGE_TAG }}
shell: bash
run: breeze prod-image build --cleanup-context --push --empty-image --run-in-parallel
if: failure() || cancelled()
- name: "Fix ownership"
shell: bash
run: breeze ci fix-ownership
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/build-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ jobs:
DOCKER_CACHE: ${{ needs.build-info.outputs.cache-directive }}
PYTHON_VERSIONS: ${{needs.build-info.outputs.all-python-versions-list-as-string}}
DEBUG_RESOURCES: ${{ needs.build-info.outputs.debug-resources }}
BUILD_TIMEOUT_MINUTES: 70
build-prod-images:
permissions:
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ jobs:
DOCKER_CACHE: ${{ needs.build-info.outputs.cache-directive }}
PYTHON_VERSIONS: ${{needs.build-info.outputs.all-python-versions-list-as-string}}
DEBUG_RESOURCES: ${{needs.build-info.outputs.debug-resources}}
BUILD_TIMEOUT_MINUTES: 70
build-prod-images:
timeout-minutes: 80
Expand Down
17 changes: 17 additions & 0 deletions BREEZE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1741,6 +1741,23 @@ These are all available flags of ``get-workflow-info`` command:
:width: 100%
:alt: Breeze ci get-workflow-info
Finding backtracking candidates
...............................
Sometimes the CI build fails because ``pip`` timeouts when trying to resolve the latest set of dependencies
for that we have the ``find-backtracking-candidates`` command. This command will try to find the
backtracking candidates that might cause the backtracking.
The details on how to use that command are explained in
`Figuring out backtracking dependencies <dev/MANUALLY_GENERATING_IMAGE_CACHE_AND_CONSTRAINTS.md#figuring-out-backtracking-dependencies>`_.
These are all available flags of ``find-backtracking-candidates`` command:
.. image:: ./images/breeze/output_ci_find-backtracking-candidates.svg
:target: https://raw.githubusercontent.com/apache/airflow/main/images/breeze/output_ci_find-backtracking-candidates.svg
:width: 100%
:alt: Breeze ci find-backtracking-candidates
Release management tasks
------------------------
Expand Down
12 changes: 3 additions & 9 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -594,7 +594,7 @@ function install_airflow_and_providers_from_docker_context_files(){
pip install "${pip_flags[@]}" --root-user-action ignore --upgrade --upgrade-strategy eager \
${ADDITIONAL_PIP_INSTALL_FLAGS} \
${reinstalling_apache_airflow_package} ${reinstalling_apache_airflow_providers_packages} \
${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS}
${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS=}
set +x

common::install_pip_version
Expand Down Expand Up @@ -665,7 +665,7 @@ function install_airflow() {
pip install --root-user-action ignore --upgrade --upgrade-strategy eager \
${ADDITIONAL_PIP_INSTALL_FLAGS} \
"${AIRFLOW_INSTALLATION_METHOD}[${AIRFLOW_EXTRAS}]${AIRFLOW_VERSION_SPECIFICATION}" \
${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS}
${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS=}
if [[ -n "${AIRFLOW_INSTALL_EDITABLE_FLAG}" ]]; then
# Remove airflow and reinstall it using editable flag
# We can only do it when we install airflow from sources
Expand Down Expand Up @@ -734,7 +734,7 @@ function install_additional_dependencies() {
set -x
pip install --root-user-action ignore --upgrade --upgrade-strategy eager \
${ADDITIONAL_PIP_INSTALL_FLAGS} \
${ADDITIONAL_PYTHON_DEPS} ${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS}
${ADDITIONAL_PYTHON_DEPS} ${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS=}
common::install_pip_version
set +x
echo
Expand Down Expand Up @@ -1290,17 +1290,11 @@ COPY --chown=airflow:0 ${AIRFLOW_SOURCES_FROM} ${AIRFLOW_SOURCES_TO}
# Add extra python dependencies
ARG ADDITIONAL_PYTHON_DEPS=""

# Those are additional constraints that are needed for some extras but we do not want to
# force them on the main Airflow package. Currently we need no extra limits as PIP 23.1+ has much better
# dependency resolution and we do not need to limit the versions of the dependencies
# !!! MAKE SURE YOU SYNCHRONIZE THE LIST BETWEEN: Dockerfile, Dockerfile.ci
ARG EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS=""

ARG VERSION_SUFFIX_FOR_PYPI=""

ENV ADDITIONAL_PYTHON_DEPS=${ADDITIONAL_PYTHON_DEPS} \
INSTALL_PACKAGES_FROM_CONTEXT=${INSTALL_PACKAGES_FROM_CONTEXT} \
EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS=${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS} \
VERSION_SUFFIX_FOR_PYPI=${VERSION_SUFFIX_FOR_PYPI}

WORKDIR ${AIRFLOW_HOME}
Expand Down
7 changes: 4 additions & 3 deletions Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,7 @@ function install_airflow() {
pip install --root-user-action ignore --upgrade --upgrade-strategy eager \
${ADDITIONAL_PIP_INSTALL_FLAGS} \
"${AIRFLOW_INSTALLATION_METHOD}[${AIRFLOW_EXTRAS}]${AIRFLOW_VERSION_SPECIFICATION}" \
${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS}
${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS=}
if [[ -n "${AIRFLOW_INSTALL_EDITABLE_FLAG}" ]]; then
# Remove airflow and reinstall it using editable flag
# We can only do it when we install airflow from sources
Expand Down Expand Up @@ -603,7 +603,7 @@ function install_additional_dependencies() {
set -x
pip install --root-user-action ignore --upgrade --upgrade-strategy eager \
${ADDITIONAL_PIP_INSTALL_FLAGS} \
${ADDITIONAL_PYTHON_DEPS} ${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS}
${ADDITIONAL_PYTHON_DEPS} ${EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS=}
common::install_pip_version
set +x
echo
Expand Down Expand Up @@ -1334,7 +1334,8 @@ ARG CASS_DRIVER_NO_CYTHON="1"
# Build cassandra driver on multiple CPUs
ARG CASS_DRIVER_BUILD_CONCURRENCY="8"

ARG AIRFLOW_VERSION="2.5.0.dev0"
# This value should be set by the CI image build system to the current timestamp
ARG AIRFLOW_VERSION=""

# Additional PIP flags passed to all pip install commands except reinstalling pip itself
ARG ADDITIONAL_PIP_INSTALL_FLAGS=""
Expand Down
197 changes: 197 additions & 0 deletions dev/MANUALLY_GENERATING_IMAGE_CACHE_AND_CONSTRAINTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@

- [Purpose of the document](#purpose-of-the-document)
- [Automated image cache and constraints refreshing in CI](#automated-image-cache-and-constraints-refreshing-in-ci)
- [Figuring out backtracking dependencies](#figuring-out-backtracking-dependencies)
- [Why we need to figure out backtracking dependencies](#why-we-need-to-figure-out-backtracking-dependencies)
- [How to figure out backtracking dependencies](#how-to-figure-out-backtracking-dependencies)
- [Example backtracking session](#example-backtracking-session)
- [Manually refreshing the image cache](#manually-refreshing-the-image-cache)
- [Why we need to update image cache manually](#why-we-need-to-update-image-cache-manually)
- [Prerequisites](#prerequisites)
Expand Down Expand Up @@ -80,6 +84,199 @@ rebuilding of [Breeze](../BREEZE.rst) images for development purpose. This is al
step makes sure that constraints are committed and pushed just before the cache is refreshed, so
there is no problem with conflicting dependencies.


# Figuring out backtracking dependencies

## Why we need to figure out backtracking dependencies

Sometimes, very rarely the CI image in `canary` builds take a very long time to build. This is usually
caused by `pip` trying to figure out the latest set of dependencies (`eager upgrade`) .
The resolution of dependencies is a very complex problem and sometimes it takes a long time to figure out
the best set of dependencies. This is especially true when we have a lot of dependencies and they all have
to be found compatible with each other. In case new dependencies are released, sometimes `pip` enters
a long loop trying to figure out if the newly released dependency can be used, but due to some other
dependencies of ours it is impossible, but it will take `pip` a very long time to figure it out.

This is visible in the "build output" as `pip` attempting to continuously backtrack and download many new
versions of various dependencies, trying to find a good match.

This is why we sometimes we need to help pip to skip newer versions of those dependencies, until the
condition that caused the backtracking is solved.

We do it by adding `dependency<=version` to the EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS variable in
`Dockerfile.ci`. The trick is to find the dependency that is causing the backtracking.

Here is how. We use `bisecting` methodology to try out candidates for backtrack triggering among the
candidates that have been released in PyPI since the last time we successfully run
``--upgrade-to-newer-dependencies`` and committed the constraints in the `canary` build.

## How to figure out backtracking dependencies

First - we have a breeze command that can help us with that:

```bash
breeze ci find-backtracking-candidates
```

This command should be run rather quickly after we notice that the CI build is taking a long time and fail,
because it is based on the fact that eager upgrade produced valid constraints at some point of time and
it tries to find out what dependencies have been added since then and limit them to the version that
was used in the constraints.

You can also - instead of running the command manually rely on the failing CI builds. We run the
`find-backtracking-candidates` command in the `canary` build when it times out, so the
easiest way to find backtracking candidates is to find the first build that failed with timeout - it
will likely have the smallest number of backtracking candidates. The command outputs the limitation
for those backtracking candidates that are guaranteed to work (because they are taken from the latest
constraints and they already succeeded in the past when the constraints were updated).

Then we run ``breeze ci-image build --upgrade-to-newer-dependencies --eager-upgrade-additional-requirements "REQUIREMENTS"``
to check which of the candidates causes the long builds. Initially you put there the whole list of
candidates that you got from the `find-backtracking-candidates` command. This **should** succeed. Now,
the next step is to narrow down the list of candidates to the one that is causing the backtracking.

We narrow-down the list by "bisecting" the list. We remove half of the dependency limits and see if it
still works or not. It it works - we continue. If it does not work, we restore the removed half and remove
the other half. Rinse and repeat until there is only one dependency left - hopefully
(sometimes you will need to leave few of them).

This way we can relatively quickly narrow down the dependency that is causing the backtracking. Once we
figure out which dependency is causing it, we can attempt to figure it out why it is causing the backtracking
by specifying the latest released version of the dependency as `== <latest released version>` in the
`--eager-upgrade-additional-requirements`. This should rather quickly fail and `pip` should show us what
the dependency is conflicting with. There might be multiple reasons for that. Most often it is simply
a dependency that has a requirement that is limited and we need to wait until new version of that
dependency is released.

Note that - such build **might** even succeed - surprisingly. Then this is simply a sign that `pip`
algorithm for `--eager-upgrade` was not perfect and the solution could be found given sufficient time.
In such case it might also be that removing the limit in the next few days will not cause the backtracking.

Finally, in order to make the change permanent in our CI builds, we should add the limitation to the
`EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS` arg in `Dockerfile.ci` and commit the change. We usually commit
the limits with `<VERSION` suffix (where version is the version that causes backtracking - usually that will
be the latest released version, unless that dependency had quick subsequent releases - you can try it before
committing by simply adding it to `EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS` in Dockerfile.ci and
running `breeze ci-image build --upgrade-to-newer-dependencies`. Make sure to add the comment explaining
when we should remove the limit.


Later on - periodically we might attempt to remove the limitation and see if the backtracking is still
happening. If it is not - we just remove the limitation from `Dockerfile.ci` and commit the change.

## Example backtracking session

This is the example backtracking session run on 13th of August 2023 after the `canary` CI image build
started to fail with timeout a day before.

1. The `breeze ci-image build --upgrade-to-newer-dependencies` failed on CI after 80 minutes.

2. The output of the `breeze ci find-backtracking-candidates` command:

```
Last constraint date: 2023-08-09 21:48:23
Latest version aiobotocore==2.6.0 release date: 2023-08-11 20:43:19. In current constraints: 2.5.4)
Latest version asana==4.0.5 release date: 2023-08-11 18:56:04. In current constraints: 3.2.1)
Latest version async-timeout==4.0.3 release date: 2023-08-10 16:35:55. In current constraints: 4.0.2)
Latest version aws-sam-translator==1.73.0 release date: 2023-08-10 00:01:00. In current constraints: 1.72.0)
Latest version azure-core==1.29.1 release date: 2023-08-10 05:09:59. In current constraints: 1.29.0)
Latest version azure-cosmos==4.5.0 release date: 2023-08-09 23:43:07. In current constraints: 4.4.0)
Latest version boto3==1.28.25 release date: 2023-08-11 19:23:52. In current constraints: 1.28.17)
Latest version botocore==1.31.25 release date: 2023-08-11 19:23:34. In current constraints: 1.31.17)
Latest version cfgv==3.4.0 release date: 2023-08-12 20:38:16. In current constraints: 3.3.1)
Latest version coverage==7.3.0 release date: 2023-08-12 18:34:06. In current constraints: 7.2.7)
Latest version databricks-sql-connector==2.9.1 release date: 2023-08-11 17:32:12. In current constraints: 2.8.0)
Latest version google-ads==21.3.0 release date: 2023-08-10 18:10:22. In current constraints: 21.2.0)
Latest version google-cloud-aiplatform==1.30.1 release date: 2023-08-11 21:19:50. In current constraints: 1.29.0)
Latest version grpcio-status==1.57.0 release date: 2023-08-10 15:54:17. In current constraints: 1.56.2)
Latest version grpcio==1.57.0 release date: 2023-08-10 15:51:52. In current constraints: 1.56.2)
Latest version mypy==1.5.0 release date: 2023-08-10 12:46:43. In current constraints: 1.2.0)
Latest version pyzmq==25.1.1 release date: 2023-08-10 09:01:18. In current constraints: 25.1.0)
Latest version tornado==6.3.3 release date: 2023-08-11 15:21:47. In current constraints: 6.3.2)
Latest version tqdm==4.66.1 release date: 2023-08-10 11:38:57. In current constraints: 4.66.0)
Latest version virtualenv==20.24.3 release date: 2023-08-11 15:52:32. In current constraints: 20.24.1)
Found 20 candidates for backtracking
Run `breeze ci-image --upgrade-to-newer-dependencies --eager-upgrade-additional-requirements "aiobotocore<=2.5.4 asana<=3.2.1 async-timeout<=4.0.2 aws-sam-translator<=1.72.0 azure-core<=1.29.0
azure-cosmos<=4.4.0 boto3<=1.28.17 botocore<=1.31.17 cfgv<=3.3.1 coverage<=7.2.7 databricks-sql-connector<=2.8.0 google-ads<=21.2.0 google-cloud-aiplatform<=1.29.0 grpcio-status<=1.56.2 grpcio<=1.56.2
mypy<=1.2.0 pyzmq<=25.1.0 tornado<=6.3.2 tqdm<=4.66.0 virtualenv<=20.24.1"`. It should succeed.
```

3. As instructed, run:

```bash
breeze ci-image build --upgrade-to-newer-dependencies --eager-upgrade-additional-requirements "\
aiobotocore<=2.5.4 asana<=3.2.1 async-timeout<=4.0.2 aws-sam-translator<=1.72.0 \
azure-core<=1.29.0 azure-cosmos<=4.4.0 boto3<=1.28.17 botocore<=1.31.17 cfgv<=3.3.1 coverage<=7.2.7 \
databricks-sql-connector<=2.8.0 google-ads<=21.2.0 google-cloud-aiplatform<=1.29.0 \
grpcio-status<=1.56.2 grpcio<=1.56.2 mypy<=1.2.0 pyzmq<=25.1.0 tornado<=6.3.2 tqdm<=4.66.0 virtualenv<=20.24.1"
```

The build succeeded in ~ 8 minutes.

4. Removed the second half:

```
breeze ci-image build --upgrade-to-newer-dependencies --eager-upgrade-additional-requirements "\
aiobotocore<=2.5.4 asana<=3.2.1 async-timeout<=4.0.2 aws-sam-translator<=1.72.0 \
azure-core<=1.29.0 azure-cosmos<=4.4.0 boto3<=1.28.17 botocore<=1.31.17 cfgv<=3.3.1 coverage<=7.2.7"
```

The build succeeded in ~ 8 minutes.

5. Removed the second half:

```
breeze ci-image build --upgrade-to-newer-dependencies --eager-upgrade-additional-requirements "\
aiobotocore<=2.5.4 asana<=3.2.1 async-timeout<=4.0.2 aws-sam-translator<=1.72.0"
```

The build succeeded in ~ 8 minutes.

6. Removed the second half:

```
breeze ci-image build --upgrade-to-newer-dependencies \
--eager-upgrade-additional-requirements "aiobotocore<=2.5.4 asana<=3.2.1"
```

The build succeeded in ~ 8 minutes.

6. Removed aiobotocore

```
asana<=3.2.1
```

The image build continued running way past 10 minutes and downloading many versions of many dependencies.

7. Removed asana and restored aiobotocore

```
aiobotocore<=2.5.4
```

The build succeeded. Aiobotocore is our culprit.

8. Check the reason for backtracking (using latest released version of aiobotocore):

```bash
breeze ci-image build --upgrade-to-newer-dependencies --eager-upgrade-additional-requirements "aiobotocore==2.6.0"
```

Note. In this case the build succeeded, which means that this was simply a flaw in the `pip` resolution
algorithm (which is based on some heuristics) and not a real problem with the dependencies. We will
attempt to remove the limit in the next few days to see if the problem is resolved by other dependencies
released in the meantime.

9. Updated additional dependencies in `Dockerfile.ci` with appropriate comment:

```
# aiobotocore is limited temporarily until it stops backtracking pip
ARG EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS="aiobotocore<2.6.0"
```

# Manually refreshing the image cache

## Why we need to update image cache manually
Expand Down
10 changes: 10 additions & 0 deletions dev/breeze/src/airflow_breeze/commands/ci_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,3 +411,13 @@ def get_workflow_info(github_context: str, github_context_input: StringIO):
sys.exit(1)
wi = workflow_info(context=context)
wi.print_all_ga_outputs()


@ci_group.command(
name="find-backtracking-candidates",
help="Find new releases of dependencies that could be the reason of backtracking.",
)
def find_backtracking_candidates():
from airflow_breeze.utils.backtracking import print_backtracking_candidates

print_backtracking_candidates()
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,5 @@
}
],
"breeze ci resource-check": [],
"breeze ci find-backtracking-candidates": [],
}
Loading

0 comments on commit 1cf960d

Please sign in to comment.