Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply the necessary Airflow configuration on the top process itself (entrypoint.py) instead of just child processes. #99

Open
rafidka opened this issue Jun 27, 2024 · 0 comments

Comments

@rafidka
Copy link
Contributor

rafidka commented Jun 27, 2024

Overview

To properly configure Airflow, we use environment variables rather than directly making changes to the airflow.cfg file, as this makes it easier to track the changes we are intentional about, while leaving Airflow's defaults otherwise. Those environment variables are defined by the entrypoint.py, and then passed down to every Airflow process we create during the creation of the process.

This has the advantage of being very clean and explicit, and also keep us in full control as to what to pass to every sub-process, as opposed to the previous images (internal) where we define and export environment variables in the entrypoint.sh file.

One down side of this approach, however, is that our entrypoint.py process doesn't have the required Airflow configuration, meaning that when we import airflow modules, they will have the wrong configuration. One example where this might cause issue is if we want to report metrics using StatsD form the entrypoint.py (or one if its sub-modules). If we simply import Airflow's Stats object, it will have the wrong configuration, and won't report anything (see #100).

To solve this issue, we need to hot-swap the os.environ object after we build our environment variables here. The catch, however, is that this needs to be done before we include any Airflow module, which means we require some substantial refactoring, which for now I am avoiding as it is a bit risky to make substantial changes to code bases right before the launch. However, we should still work on this soon after launch, with proper testing.

Acceptance Criteria

  • Refactor the entrypoint.py code such that it doesn't import any airflow module at the top level, directly or indirectly, before we have updated the environment variables.
  • The main() method of the entrypoint should then do the following:
    • First thing, ensure that no Airflow module has been imported by checking sys.modules.
    • Update the environment variables before importing any Airflow module.
    • Set up logging. This has to come after updating the environment variables, as our logging relies on Airflow's, meaning that it will inevitably import Airflow.
    • Execute the start up script.
    • Update the environment variables (again) if the customer has exported some environment variables.
    • Execute the command (scheduler, worker, etc.)

Additional Info

Things to keep in mind:

  • We need to update our logging config to make sure we don't import any Airflow module. This, unfortunately, means that we will have to create our own copy of the DEFAULT_LOGGING_CONFIG dictionary and make sure we maintain across the versions. Thus, we should find a way to automate this somehow, as otherwise it will get updated.
  • Notice that our TaskLogHandler makes use of Airflow's CloudwatchTaskHandler. Hence, even if we remove all imports from our logging config module, they will still be imported when we call logging.config.dictConfig, as then Python will import cloudwatch_handlers.py module, which will then import Airflow modules. What that means is that we CANNOT setup logging (via the logging.config.dictConfig) method before have set the required environment variables. Thus, the logging setup should be moved under the main() method (as mentioned above in the Acceptance Criteria.)
@rafidka rafidka changed the title Applies the necessary Airflow configuration on the top process itself (entrypoint.py) instead of just child processes. Apply the necessary Airflow configuration on the top process itself (entrypoint.py) instead of just child processes. Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant