Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple projects' dbt docs #862

Open
dwreeves opened this issue Feb 26, 2024 · 3 comments
Open

Support multiple projects' dbt docs #862

dwreeves opened this issue Feb 26, 2024 · 3 comments
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:docs Primarily related to dbt docs command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc triage-needed Items need to be reviewed / assigned to milestone
Milestone

Comments

@dwreeves
Copy link
Collaborator

Discussion here: #737 (comment)

Better to get this done now before 1.4 ships than after, I think.

tldr: The main worry I have is that the API for supporting multiple projects' docs can get very convoluted. Ini file format / env vars for config do not natively support arrays of objects, so you need to do something that isn't pretty or is complicated (e.g. read from a yaml file) basically no matter what.

I'm leaving this as an open ended issue to discuss solutions.

Copy link

dosubot bot commented Feb 26, 2024

🤖

Hey there @dwreeves! Long time no chat on GitHub issues, huh? 😄

To support multiple projects' dbt docs in the Cosmos repository without complicating the API, you could consider creating multiple instances of the ProjectConfig class, each representing a different dbt project.

The ProjectConfig class in the Cosmos repository is designed to handle the configuration of a single dbt project. It includes parameters such as dbt_project_path, models_relative_path, seeds_relative_path, snapshots_relative_path, manifest_path, project_name, env_vars, dbt_vars, and partial_parse.

Each instance of ProjectConfig can be associated with a different dbt project by specifying a different dbt_project_path (the path to the dbt project directory) and project_name (the name of the project).

Here is an example of how you can create multiple ProjectConfig instances:

project_config1 = ProjectConfig(
    dbt_project_path="/path/to/dbt/project1",
    project_name="project1",
    env_vars={"DBT_ENV_VAR1": "value1"},
    dbt_vars={"dbt_var1": "value1"},
)

project_config2 = ProjectConfig(
    dbt_project_path="/path/to/dbt/project2",
    project_name="project2",
    env_vars={"DBT_ENV_VAR2": "value2"},
    dbt_vars={"dbt_var2": "value2"},
)

In this example, project_config1 and project_config2 represent two different dbt projects. You can create as many ProjectConfig instances as you need, each representing a different dbt project.

This approach allows you to support multiple dbt projects without complicating the API or dealing with the limitations of ini file format and environment variables for configuration. Each ProjectConfig instance encapsulates all the configuration details of a dbt project, making it easy to manage multiple projects.

For more details, you can refer to the ProjectConfig class in the cosmos/config.py file.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:docs Primarily related to dbt docs command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc labels Feb 26, 2024
@dwreeves
Copy link
Collaborator Author

So one implementation that is fully compatible with the existing API would be to do something like this:

  • User can provide comma separated list of paths.
  • If they do, then create list of each dbt project and that is the landing page for the docs.
  • If they don't, then render the docs normally.
  • If there is one or none connection_id, then apply to all items in the list.
  • If there are multiple comma separated connections, then zip() with the paths.

The main challenge here is getting the names of the dbt projects. This requires downloading each json and then retrieving the manifest["metadata"]["project_name"]. It could be prudent, albeit minor over-optimizing, to use flask-caching (and allow the user to override the default time for the cache to 0 if they don't want this) because this does end up downloading a lot of files. (Note: flask-caching is already a dependency of Airflow.) Another option is to allow users to pass in their own labels.

If reading from manifest.metadata.project_name to get the project names for each doc, you will want to have some sort of exception handling or timeout logic perhaps if there is an issue, if say S3 docs aren't loading but the local ones are. I dunno. Maybe that is too complicated.

The issue with creating a new menu item for dbt docs is twofold. First, it's not appropriate for most users with just one project. Two, dbt project names cannot realistically be automated in this context as loading the manifest.json can block the Airflow UI from loading on normal, non-dbt docs pages, and also wastes S3 reads. Or worse: imagine a scenario where your manifest.json is messed up, and your entire Airflow UI crashes because the plugin is attempting to read a corrupt or nonexistent JSON, but you also need access to the Airflow UI to diagnose the problem... not good. You cannot automate the names of the UI elements from the manifest.jsons if you are doing an app-builder menu item approach.

I'm not a happy camper setting any of this up because airflow.cfg (.ini file format) isn't well suited for this (.ini doesn't natively have an array type, which is the proper data model for this), and also it feels like there is no way to avoid that one of the two setups (solo docs, or multi docs) ends up with the suboptimal end of the stick. most users just have one dbt project, so it makes sense to have an abstraction that prioritizes single project deploys (which the current API does well) and let people with complicated setups deal with a slightly complicated API, which is fair and even congruent with what they're doing already. I'm thinking this is the least intrusive way to support multiple projects while keeping the user experience friendly to the majority of users with just one project. (Also, some users are on multiple projects because large projects execute slowly, which is also being addressed in 1.4.)

@tatiana tatiana added this to the 1.6.0 milestone May 17, 2024
@tatiana
Copy link
Collaborator

tatiana commented Jul 5, 2024

@dwreeves, we aim to release Cosmos 1.6 by the end of the month. Do you think you may have the bandwidth to work on this before? If not, what do you think about moving it to the 1.7 release or after?

@tatiana tatiana added the triage-needed Items need to be reviewed / assigned to milestone label Jul 5, 2024
@tatiana tatiana modified the milestones: Cosmos 1.6.0, Cosmos 1.7.0 Jul 30, 2024
@tatiana tatiana modified the milestones: Cosmos 1.7.0, Triage Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration dbt:docs Primarily related to dbt docs command or functionality execution:local Related to Local execution environment parsing:custom Related to custom parsing, like custom DAG parsing, custom DBT parsing, etc triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

No branches or pull requests

2 participants