You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To properly configure Airflow, we use environment variables rather than directly making changes to the airflow.cfg file, as this makes it easier to track the changes we are intentional about, while leaving Airflow's defaults otherwise. Those environment variables are defined by the entrypoint.py, and then passed down to every Airflow process we create during the creation of the process.
This has the advantage of being very clean and explicit, and also keep us in full control as to what to pass to every sub-process, as opposed to the previous images (internal) where we define and export environment variables in the entrypoint.sh file.
One down side of this approach, however, is that our entrypoint.py process doesn't have the required Airflow configuration, meaning that when we import airflow modules, they will have the wrong configuration. One example where this might cause issue is if we want to report metrics using StatsD form the entrypoint.py (or one if its sub-modules). If we simply import Airflow's Stats object, it will have the wrong configuration, and won't report anything (see #100).
To solve this issue, we need to hot-swap the os.environ object after we build our environment variables here. The catch, however, is that this needs to be done before we include any Airflow module, which means we require some substantial refactoring, which for now I am avoiding as it is a bit risky to make substantial changes to code bases right before the launch. However, we should still work on this soon after launch, with proper testing.
Acceptance Criteria
Refactor the entrypoint.py code such that it doesn't import any airflow module at the top level, directly or indirectly, before we have updated the environment variables.
The main() method of the entrypoint should then do the following:
First thing, ensure that no Airflow module has been imported by checking sys.modules.
Update the environment variables before importing any Airflow module.
Set up logging. This has to come after updating the environment variables, as our logging relies on Airflow's, meaning that it will inevitably import Airflow.
Execute the start up script.
Update the environment variables (again) if the customer has exported some environment variables.
Execute the command (scheduler, worker, etc.)
Additional Info
Things to keep in mind:
We need to update our logging config to make sure we don't import any Airflow module. This, unfortunately, means that we will have to create our own copy of the DEFAULT_LOGGING_CONFIG dictionary and make sure we maintain across the versions. Thus, we should find a way to automate this somehow, as otherwise it will get updated.
Notice that our TaskLogHandler makes use of Airflow's CloudwatchTaskHandler. Hence, even if we remove all imports from our logging config module, they will still be imported when we call logging.config.dictConfig, as then Python will import cloudwatch_handlers.py module, which will then import Airflow modules. What that means is that we CANNOT setup logging (via the logging.config.dictConfig) method before have set the required environment variables. Thus, the logging setup should be moved under the main() method (as mentioned above in the Acceptance Criteria.)
The text was updated successfully, but these errors were encountered:
rafidka
changed the title
Applies the necessary Airflow configuration on the top process itself (entrypoint.py) instead of just child processes.
Apply the necessary Airflow configuration on the top process itself (entrypoint.py) instead of just child processes.
Jun 27, 2024
Overview
To properly configure Airflow, we use environment variables rather than directly making changes to the
airflow.cfg
file, as this makes it easier to track the changes we are intentional about, while leaving Airflow's defaults otherwise. Those environment variables are defined by theentrypoint.py
, and then passed down to every Airflow process we create during the creation of the process.This has the advantage of being very clean and explicit, and also keep us in full control as to what to pass to every sub-process, as opposed to the previous images (internal) where we define and export environment variables in the
entrypoint.sh
file.One down side of this approach, however, is that our entrypoint.py process doesn't have the required Airflow configuration, meaning that when we import
airflow
modules, they will have the wrong configuration. One example where this might cause issue is if we want to report metrics using StatsD form the entrypoint.py (or one if its sub-modules). If we simply import Airflow'sStats
object, it will have the wrong configuration, and won't report anything (see #100).To solve this issue, we need to hot-swap the
os.environ
object after we build our environment variables here. The catch, however, is that this needs to be done before we include any Airflow module, which means we require some substantial refactoring, which for now I am avoiding as it is a bit risky to make substantial changes to code bases right before the launch. However, we should still work on this soon after launch, with proper testing.Acceptance Criteria
entrypoint.py
code such that it doesn't import anyairflow
module at the top level, directly or indirectly, before we have updated the environment variables.sys.modules
.Additional Info
Things to keep in mind:
TaskLogHandler
makes use of Airflow'sCloudwatchTaskHandler
. Hence, even if we remove all imports from our logging config module, they will still be imported when we calllogging.config.dictConfig
, as then Python will importcloudwatch_handlers.py
module, which will then import Airflow modules. What that means is that we CANNOT setup logging (via thelogging.config.dictConfig
) method before have set the required environment variables. Thus, the logging setup should be moved under the main() method (as mentioned above in the Acceptance Criteria.)The text was updated successfully, but these errors were encountered: