Feedback on our Documentation #23031

PedramNavid · 2024-07-16T15:58:12Z

PedramNavid
Jul 16, 2024
Maintainer

Hi all!

This quarter, our team is focusing on improving our documentation. We know that Dagster can sometimes be complex to understand, and we're hoping to improve the overall experience from your first Hello World all the way to guides, detailed explanations, and API docs.

I'd love to hear what you love and hate about our docs.

Please use the template below and provide as much detail as you can:

Role:
Experience with Dagster (Low/Med/High):
Your feedback (be as specific as possible, e.g., what were you trying to do, where did you end up going, were you successful):

chrishiste · 2024-07-16T16:04:21Z

chrishiste
Jul 16, 2024

Role: Data Platform Engineer
Experience with Dagster (Low/Med/High): High

Most example snippets are not runnable as is, missing imports or variables definition. They require us to figure out the rest of the code which is not always straighforward. I would like to be able to copy and paste.

Love the ability to time-travel to old versions. Love your detailed changelogs on Github.

0 replies

ClaytonSmith · 2024-07-16T16:13:36Z

ClaytonSmith
Jul 16, 2024

Role: Data Platform Engineer, ML dev
Experience with Dagster (Low/Med/High): Med

Dagster.yaml, Dagster.yaml, Dagster.yaml.....
There isn't enough coverage of the parameters this file uses. Most code examples are useless without some config set in Dagster.yaml. Real example from yesterday: https://docs.dagster.io/guides/limiting-concurrency-in-data-pipelines#limiting-opasset-concurrency-across-runs. There isn't any example of how to configure these setting in Dagster.yaml.

Side note: CLI based config should be a last resort, not the first example in documentation. I outright refuse to use any config that can't be versioned in code or at least in some .env file. (this include the cloud)

2 replies

nickvazz Jul 16, 2024

to add: would love to see a Dagster.yaml with every option enumerated

drewagentsync Jul 16, 2024

This bug that prevents me from using Env Vars to populate dagster.yaml if the value I'm inputting isn't a string needs some attention... The whole dagster.yaml config method needs a lot more attention in the docs.

https://github.com/dagster-io/dagster/issues/18729

Many spots in that file are int or bool but I'm unable to populate them without using a tool like envsubst or doppler secrets substitute to hard code the file during the container build. It'd be so much easier if this just worked as advertised. The docs make no mention that only strings are supported.

I noticed one of the errors enumerates all the options for the file. I'd have to look and see if I saved it somewhere.

askvinni · 2024-07-16T16:23:19Z

askvinni
Jul 16, 2024

Role: Lead Data Architect
Experience with Dagster (Low/Med/High): High

I haven't had too many people ping me asking for how to build basic assets (incl resources and all of that), so my anecdotal impression is that the high-level explanation of concepts to get up and running seems to be fairly straightforward for most developers (even those without data experience, which has been the majority of people slotting into the system), so props on that side.

On the other hand, I myself don't frequent the documentation quite as often anymore since I've been working with the system for so long, but one thing that I've been noticing more and more is that there's a lot of "hidden" functionality that hasn't been in the docs and I had to dig through the codebase to find. Some examples are specific system tags (e.g. storage_kind, or compute_kind when not set through the @asset decorator) and metadata (e.g. dagster/row_count) are two that I recently started using, but I'm sure there's more that I missed.

Overall it seems like there's a lot of functionality added more or less recently (Definitions.merge() comes to mind as well) that fell through the cracks, and I only found out about a lot of it since I'm following the open PRs and discussions pretty closely.

3 replies

nickvazz Jul 16, 2024

to add: would love to see all of the metadata you mention with all the current dagster/SOMETHING tags

will-holley Jul 16, 2024

Related: recommended best-practices are changing often, for example assets no longer returning an Output / setting an IO manager. This has necessitated refreshing my understanding / practices every time I onboard someone new into Dagster, which isn't ideal and has leds to implementation-divergence friction (typically because an engineer defaults to looking at the latest documentation / examples rather than for our pinned version and I have to explain why we're doing something the "old way").

erinkcochran87 Jul 17, 2024

to add: would love to see all of the metadata you mention with all the current dagster/SOMETHING tags

Then we've got a surprise for you! 🎁 #23045

gnilrets · 2024-07-16T16:26:15Z

gnilrets
Jul 16, 2024

Role: Sr Staff Data Engineer
Experience with Dagster (Low/Med/High): Med

Well, either I just understand Dagster better, or the docs have gotten MUCH better in the last year. When I started you were transitioning from ops/graphs to assets, and I struggled a lot with understanding WHY you'd want to use certain abstractions or methods (e.g., what problem does it solve, with concrete examples) . I still get that feeling a bit when I read through new concepts, so maybe that's still an issue. I haven't actually gone through the course, but Dagster University helped a colleague get up to speed and start contributing to our repo very quickly.

One thing I often run into is just some missing random missing piece of documentation - e.g., the link from a class to the api docs/code is broken, or a class doc is missing a method or return value. Unfortunately it doesn't usually feel worthwhile to go through the process of submitting a change request in github. It might be nice if there was some quick way to provide in-line feedback for the docs.

3 replies

gnilrets Jul 16, 2024

Add: I started following the Recommended Project Structure, and discovered that I really don't like it. It's very unnatural to have an asset-driven pipeline where the assets are all grouped into one place, but the related jobs, schedules, and sensors are somewhere else mixed in with other pipelines. I'm starting a project soon to reorganize our project structure around the business focus of the pipelines rather than the dagster abstractions.

kristianandre Jul 17, 2024

The dagster-open-platform repo was recently restructured and no longer groups all assets into one folder. I think the new structure makes a lot more sense. They have grouped schedules, resources, assets and definitions based on asset type, but one could easily do something similar based on business focus instead. It's very helpful that one can now merge Definitions objects. Hopefully the docs can recommend the new project structure as an alternative at least.

PedramNavid Jul 17, 2024
Maintainer Author

@kristianandre You're spot on. We are hoping to update the recommended structure once we've had a little bit of time to validate that we prefer the updated methodology with Definitons.merge.

ReneTC · 2024-07-16T16:37:35Z

ReneTC
Jul 16, 2024

Role:
freelance data scientist

Experience with Dagster (Low/Med/High):
low

Your feedback:
I think the examples are great, but contains a lot of unimportant extra work. I saw two guides where you pull the GitHub stars into a postgress database.
To get this example to work you need to not only make a database, set env password and so on you also need to fetch GitHub credentials.

Why not just show the example fetching something public available, and use SQLite or duckeb? Cod would be copy paste and working without any extra work

I ended up deviating from the example, I was successful but it took some extra

0 replies

zyd14 · 2024-07-16T16:40:26Z

zyd14
Jul 16, 2024

Role: Data Platform team lead
Experience: High
Some common themes / questions I see in the Dagster slack:

difference between asset jobs and op jobs, when to use @job vs. define_asset_job
difference between ad-hoc asset materialization and assets materialized through an asset job, when to use one over the other (I still don't really know the differences here personally, I just default to using asset jobs because random things seem to be missing from ad-hoc materializations, but I don't think the difference is documented anywhere
- how to apply tags for ad-hoc materializations, limitations of ad-hoc materializations for run launchers like ECS that use job run tags to set memory / cpu
how to optimize for latency; would be helpful to point folks towards certain deployment patterns that can better support low-latency startup / initialization. Usually folks want to see how they can reduce op / job / asset initialization time. The possibility of a ThreadPoolExecutor gets brought up every couple weeks but appears not to be something anyone is interested in working on, so alternative suggestions to users who are looking for this kind of low-latency setup would be helpful
- Similarly how to optimize for high concurrency, limits on concurrency of jobs / ops / assets
ideal granularity of ops / assets
Dynamic graph generation, along with notes about the limitations that Dagster requires assets to be defined at code deployment time, so you can't dynamically add assets without reloading your code location.
Mixing assets into op-jobs - what works and what doesn't
Rough scale that Dagster can handle for assets / jobs / partitions
Patterns for accessing asset values from outside of Dagster
Using config mappings to simplify repeated configurations
using the DagsterInstance to access event logs and other Dagster metadata (a dedicated page of examples here would greatly help folks who want to achieve additional metadata monitoring, this is a very common ask)
like @chrishiste said, most snippets aren't runnable and prioritize brevity over completeness. Often snippets will implicitly build on one another, with variables in one snippet being referenced in a different snippet. And they rarely provide imports. It may be helpful in situations where the snippets build on eachother to provide the full code in one snippet at the start of a page, and then break it down into individual snippets with explanations (or provide the full snippet after the breakdowns).
The WHY of resources and IO managers - why should users use them? (hint - dependency injection, separation of storage and transformation, both leading to better testability)
folks often want to dynamically add / modify / remove assets from their deployment without redeploying. An explanation of why this can't be done would probably be useful

I'll add more as it comes to mind

2 replies

askvinni Jul 16, 2024

using the DagsterInstance to access event logs and other Dagster metadata (a dedicated page of examples here would greatly help folks who want to achieve additional metadata monitoring, this is a very common ask

+1 here, I forgot to add this to my comment but it's something else we've done that I really only figured out through looking at the internals.

will-holley Jul 16, 2024

+1 for optimization on low latency, high-throughput jobs. I've found Dagster to be semi-effective for observing streaming-triggered jobs (via sensors) and would like to use it to orchestrate / observe / maintain web scraped-assets but node startup times & limited support for >1k job concurrency (in my experience) is a blocker.

drewagentsync · 2024-07-16T17:10:31Z

drewagentsync
Jul 16, 2024

role: Staff DevOps Engineer
experience with dagster: Low

I would appreciate some implementation examples for the infrastructure for Dagster+ Hybrid on ECS Fargate that goes more in depth about optimal/possible configurations. Go over some common, successful deployments of both the Self-Service and the Pro/Enterprise plan you see in the field and weigh some of the pros and cons of the approaches and explore benefits of the features you get with Pro/Enterprise that might make your implementation better. It's pretty hard to tell if the use-case I'm trying to serve can be handled with the self-service plan or if I need the pro/ent plan to handle some of our requirements.

What I'm trying to do is implement dagster in the usual 1 AWS Account per Development environment setup with github actions for cicd in our data engineer repos to test changes with branch deployments in lower environments and scheduled prod jobs in our prod environment.

example scenarios I'd like to see explored

As a self-service customer, how can I use my single deployment to serve multiple development environments and multiple data projects that have different resources they need to access. Is this something I can handle from a single Dagster ECS cluster with network access to all environments and a specific configuration in my dagster git hub workflows?
Should I use agent queues (A Pro/Ent/Plus feature) assigned to each data-service and environment and setup clusters in each environment to run the jobs (this seems right but we're still on self-service)?
What is the preferred way to set env/project specific environment variables? In the Dagster Console? In the cloud_deploy.yaml? Using a secrets injection tool like Doppler in the Docker Image entry point? I get they're all an option but what are the pros of using the Dagster specific methods?
Walk through the feature differences that would change how you implement the self-service option vs the pro/enterprise plan.

The docs currently present a lot of possible options but they'd benefit from more prescriptive advice for those with little experience with dagster trying to implement it in a fairly mature cloud environment that goes beyond deploying the cloudformation templates.

Some Devops/Infra/Platform Engineer focused docs would let me get things running and get my data engineers in and using the product to make the case for us to use it for more projects.

1 reply

zyd14 Jul 16, 2024

FYI you can isolate environments using code locations, which can be managed from a single ECS cluster. Agent queues are unnecessary for this. As far as I know there isn't really difference in deploying for self-service vs. pro/enterprise, it's all the same infrastructure.

caelan-schneider · 2024-07-16T17:15:36Z

caelan-schneider
Jul 16, 2024

Role: SWE, Data
Experience with Dagster (Low/Med/High): High
Clarity on the different partition key fields inside OpExecutionContext:
partition_key
partition_keys
partition_key_range
partition_time_window
asset_partition_key_range
asset_partition_key_for_input
asset_partition_key_for_output
asset_partition_keys_for_input
asset_partition_keys_for_output
asset_partition_time_window_for_input
asset_partition_time_window_for_output

0 replies

josh-gree · 2024-07-16T18:36:56Z

josh-gree
Jul 16, 2024

Role: Software Engineer
Experience with Dagster (Low/Med/High): Med

I want to stop seeing all these AI generated answers in google search results please - I want the docs not some unverifiable spewing of a llm - thankz!

4 replies

erinkcochran87 Jul 16, 2024

Howdy! Dagster docs manager here. Do you have an example of this? I'm interested in seeing if it's something in the docs we can fix, or if this is something on Google's side.

caelan-schneider Jul 16, 2024

Hi Erin,

Not sure if this type of search result is what he's talking about, but the third result of this google search, for example, is a slack answer from Scout.

Link:
https://discuss.dagster.io/t/16656744/how-can-i-run-a-partitionned-job-on-mutiples-partitions-but-

erinkcochran87 Jul 16, 2024

That's hepful @caelan-schneider! Thanks a bunch.

PedramNavid Aug 27, 2024
Maintainer Author

Hey @josh-gree, @caelan-schneider this has been fixed : )

geoHeil · 2024-07-16T20:08:13Z

geoHeil
Jul 16, 2024

Role: Architecting our new future enterprise data platform for a telco

Experience with Dagster High

feedback

More documentation around complex examples for integrations such as dbt-loom

3 replies

erinkcochran87 Jul 16, 2024

Just to ensure I'm not assuming anything, could you elaborate more on what you think of as a 'complex' example? This obviously excludes things like foo bar and Hello world!, but I've heard similar feedback before and want to make sure we're approaching this correctly when the time comes.

geoHeil Jul 17, 2024

For this particular case I care about sample/documented instructions how to set up:

dbt loom
with dagster/dagster cloud
in a multi project setup with many dbt projects
where each dbt project refers to its unique data domain/dagster code location

We have such an internal setup now - but I imagine it could be worthwhile for Elementl but also the dagster community to see more documentation on how things like this work

erinkcochran87 Jul 17, 2024

Thanks for the additional detail! I definitely agree more 'real world' examples would be useful to add.

thesyntaxinator · 2024-07-16T22:43:55Z

thesyntaxinator
Jul 16, 2024

Role: Senior Software Engineer

Experience with Dagster: Low

Feedback:
Trying to implement an E2E local ML model + post processing validation pipeline. Dagster seems like a good choice as there are multiple interchangeable steps in the overall process (eg download ground truth from DB, fetch inferences/other post processing inputs from S3, submit a cloud inference job, run postprocessing locally, compare post-processed results to ground truth and generate validation report locally). I'm currently trying to figure out the best way to use Dagster assets and ops to implement this. For instance, querying from the DB makes sense as an asset and running the postprocessing code locally seems to make sense as an op (can accept different inputs, isn't stateful) but I need to read up a bit more on ops to confirm this.

I like that the Dagster university tutorial goes over how to use assets but I would have liked it to also go over ops, when to use assets vs ops, and how to tie things together when you have a multistep pipeline with multiple interchangeable parts (eg I could download pre-run inferences from S3 or submit a job to rerun a set of fresh inferences on a given model version, I could use validation code to compare post processed output to ground truth or provide a baseline set of post processed output to compare against both baseline and ground truth). In this case I'm trying to use Dagster to build a multistep customizable pipeline where I can either run all steps or use pre-generated outputs for some steps.

Dagster seems to be quite powerful in terms of data lineage and asset materialization but I'm still trying to figure out the best way to apply these concepts to my use case and I think the getting started tutorials could go into more detail in these areas or link me to other tutorials which do. I feel like the detailed documentation is good but I'm looking for a higher-level intro that covers all the main concepts and how they fit together, as well as how to build a multistep, customizable, flexible pipeline like the above.

0 replies

aleewen · 2024-07-16T22:46:15Z

aleewen
Jul 16, 2024

Role: Jr Quant & Data Engineer
Experience with Dagster (Low/Med/High): Low
Your feedback:

I've been trying to learn this for the last two months and am struggling... the documentation is really poor and incomplete, so I'm glad that it's a priority for this quarter!
I've completed the Dagster University course and found it to be incomplete; the abstract examples with baking were not helpful. I spent days just trying to figure out how to string together assets and ops properly. The exception handling is not clear to pinpoint where I'm going wrong either.
Most examples are incomplete, unrealistic, and over-simplistic. People will learn best if you provide complete examples because it allows us to learn the patterns and syntax.

4 replies

PedramNavid Jul 17, 2024
Maintainer Author

Hi @aleewen, have you taken a look at https://github.com/dagster-io/dagster-open-platform? This is our actual production pipeline, with only a few sensitive things removed. It doesn't solve the documentation problem, but maybe can help bridge the gap until we get this fixed.

Thanks again for your feedback!

aleewen Jul 17, 2024

I haven't looked at this, but certainly will now! A fully-constructed example should be very helpful for me. Thank you!

sryza Jul 17, 2024

I spent days just trying to figure out how to string together assets and ops properly. The exception handling is not clear to pinpoint where I'm going wrong either.

@aleewen if you'd be open to sharing I'd be curious to hear more about the difficulties you faced with this. What did you need assets, what did you need ops for, and why did you need to combine them?

aleewen Jul 17, 2024

@sryza It took me a while to understand even just this concept, and I'm still not sure if it's 100% correct. But this is what I understand from trial and error:

Assets can be connected together in a pipeline by writing the previous asset's function name as an argument. For example:

from dagster import asset

@asset
def first_asset() -> int:
	return 1

@asset
def final_asset(first_asset: int) -> None:
	with open('my_file.txt', 'w') as file:
		file.write(str(first_asset))

If the previous function does not have a return, dependencies can be defined in the deps argument in the @asset decorator. As a dummy example:

from dagster import asset

@asset
def text_file() -> None:
	with open(f'text_file_{date.today()}.txt', 'w') as file:
		file.write("Today's text!")

@asset(
	deps=['text_file']
)
def load_text_to_sql() -> None:
	with open(f'text_file_{date.today()}.txt', 'r') as file:
		text_to_load = file.read().strip()
	
		sql_query = f"INSERT INTO tblMyTextTable (my_text_col) VALUES ({text_to_load})"
		conn = # create connection here
		conn.execute(sql_query)

This is simple, but was not clear to me. The asset decorator also has the ins argument, which I tried as a possibility when going in circles figuring out the correct way to connect assets together.

From what I understand, Ops are more 'adhoc', unofficial assets that provide helper functional utility. Among all of the confusion, the kicker was that I did not realize that a graph_asset cannot be asynchronous, but assets can be. My bugs went away when I re-coded my async graph_asset to a be an async asset.

I also did not understand that with a partition pipeline, all assets in the pipeline must specify the partition, not just the asset(s) that uses the partition data from the context argument. Otherwise, Dagster will look for data in the wrong storage folders when running other assets.

Auric-Manteo · 2024-07-17T06:05:38Z

Auric-Manteo
Jul 17, 2024

Hi,
And thank you for your effort in creating a great documentation and going beyond that to improve it further!

Role: Data Engineer
Experience: High
zyd14 gave a great rundown of the issues already.

I would emphasize code examples that can be run out of the box, which was a bit of a hindrance, especially in the beginning since there are some oddities that need to be considered that are not obvious - I remember spending unnecessary time on figuring out that the Definition needs to be stored in a variable for Dagster to pick up on it.
Also there are some limitations on tests that are not well documented, like it's really hard to test a single op within a job even with extensive mocking. It's possible to test only the op, but not as part of the entire job.
This is a bit more of a special case maybe, but it would be great to know more about how you can reuse an op. It seems to be quite hard to for instance change the tag of an existing op from a third party library. Tags are used to control concurrency, so that seems like it should be more straight forward.

1 reply

ClaytonSmith Jul 17, 2024

I would emphasize code examples that can be run out of the box

Def want to emphasize this should include full examples of dagster.yaml or similar configs.

medecau · 2024-07-17T11:16:47Z

medecau
Jul 17, 2024

Role: research scientist + whatever (it's a startup okay?)
Experience with Dagster: Low (three days)

i'm trying to bring dagster into the org because i'm frustrated with the status quo (run a script in a vm, maybe have cron entry to do that for you)

docs feedback

this is a bit more than docs, think summarised friction log

organization

please look into using the 'write the docs' framework

this is the format larger projects use in the python ecosystem, it's what people expect as quality documentation. see: django, pandas, keras

concepts

please explain concepts for new users, think users that have never used these kinds of tools. lots of folks using just files, http apis, maybe a file server, or an embedded database. explain it more than once too, with different approaches of course.

if i'm used to ftplib, requests, and maybe pathlib then what is an 'asset' and what is a 'resource'? where are my files going?
how do i represent this database connection?
okay, once the asset is 'materialized' what path am i supposed to share with my colleagues working on jupyter notebooks?

tutorials

there are like three different tutorials.
maybe they're the same, i have no idea. i picked up on one in the docs site, then i ran the dagster+ hybrid deployment command and there was another 'quickstart' folder in there. i felt lost.
have multiple tutorials, sure, but don't 'let' them mix.

examples

these are either incomplete or too complicated - consider the basic use and how it connects with other parts of the system, slowly introduce variations appropriate to each example.

show me how to grab some csv files and put them in a sqlite database. good because it keeps context manageable and the moving parts are understandable by new users.
what if the files are on an http server?
okay, but this server requires authentication, how do i do that safely?
oh, there's some request limit here, can we reduce concurrency?
the cloud is down. how do we do retries? exponential backoff?
alright, let's backtrack, the files are now on a sftp server and i can only use one connection, how do i download 32 large files per day efficiently? can i use partitions and backfill with that setup? how do i put that on a schedule? again, can we retry when it fails?
let's backtrack once more, we've been writing this to sqlite using duckDB, but we already have this sqlalchemy model backed by pydantic.
no, hold on, we have to put it on mongoDB now.

roleplay

have your people build projects from scratch. then improve documentation, api, and overall experience using their friction logs. get them to do this for every minor version but change the requirements each time. this should include anyone technical - devrel, engineering, devops, etc. the founder/ceo too.

test the docs

the thursday after a new minor version a pair of devrels live streams a fresh deployment, debugs any eventual hiccups, and consults the docs as they go.

astroturfing

... as documentation

get devrel folks to setup a personal deployment they run full time and where they can experiment with unique use cases. have them write blog posts and publish new integrations from their adventures. remove the training wheels by using the open source mode on a lightly supported cloud provider.

deploy dagster on rapeberrypi/heroku/browser
how i deploy my blog using dagster
how to run you ci/cd on dagster
building assets from scrapy
put your assets on kafka
run a job 15 minutes after sunset in barcelona

invite weirdness
can you convince a devrel to live-stream speedruns? could that be a regular event that happens when a new minor version drops?

1 reply

PedramNavid Jul 17, 2024
Maintainer Author

@medecau this is really great feedback! thank you. we're hiring for devrel by the way, feel free to reach out if you're interested. :)

raiAmagi · 2024-07-17T12:52:33Z

raiAmagi
Jul 17, 2024

Role:Software Engineer
Experience with Dagster (Low/Med/High): Low

I don't know where to start learning so I read page by page from the top of the document tree and put it into practice.

However, the order of the documents is not organised and I often have to read subsequent documents to understand and implement them.

I would like to see a structure that takes into account the order of the documents and their relation to the different pages.

Also, as others have said, many of the sample PGs cannot be run as they are, so I would like you to use samples that can be run.

1 reply

PedramNavid Jul 17, 2024
Maintainer Author

@raiAmagi well put, we will definitely take this into account.

svagier · 2024-07-19T13:14:14Z

svagier
Jul 19, 2024

Role: Data Engineer
Experience with Dagster (Low/Med/High): Med
Your feedback (be as specific as possible, e.g., what were you trying to do, where did you end up going, were you successful):

In Dagster UI what happens after clicking button "Re-execute" -> "From failure” is extremely counter-intuitive and can lead to silent errors that are very hard to detect. It is also not covered in docs. Details and example: https://dagster.slack.com/archives/C066HKS7EG1/p1718886752562149 . To achieve what I wanted I ended up with writing SQL query that gets all the ops that I should rerun, and then running GraphQL LaunchPipelineReexecution mutation with these ops specified as stepKeys.
Hardly any useful info about GraphQL mutation and their parameters in the docs. The only useful info I found was actually on /graphql endpoint in “Docs” tab, as stated here: https://docs.dagster.io/concepts/webserver/graphql#exploring-the-graphql-schema-and-documentation. Why not include this in the documentation on docs.dagster.io?
Documentation lacking clear info and policy on: can you (and should you) call an op from inside of another op? https://dagster.slack.com/archives/C066HKS7EG1/p1721118864077859

2 replies

erinkcochran87 Jul 19, 2024

This is really helpful - thanks for being so specific and providing links to examples! I've added this to the list of things to improve in our project.

svagier Jul 25, 2024

@erinkcochran87 also just realized the docs are missing info that it is indeed possible to set tags for aliased ops: https://dagster.slack.com/archives/C066HKS7EG1/p1721917746053649

j-blackwell · 2024-07-19T16:49:34Z

j-blackwell
Jul 19, 2024

Role: Platform Engineer
Experience with Dagster: High
Your feedback: Dagster+Hybrid documentation is severely lacking and disjointed. There is not a full end-to-end example of how to deploy a code location and agent using the most basic deployment option (local docker). Overall, documentation for hybrid is lacking.

0 replies

ClaytonSmith · 2024-07-22T13:36:24Z

ClaytonSmith
Jul 22, 2024

This topic should get more documentation: #12251

I know I came in with the assumption that Dagster would have some mechanism to manage concurrency. It took waaayy too long to figure out that technically yes, it does but not in a way that is expected or useful.

0 replies

food-spotter · 2024-08-02T08:32:40Z

food-spotter
Aug 2, 2024

Role: Engineering Team Lead
Experience with Dagster: Med

The API documentation source code should support syntax highlighting. This makes readability much better. Example:

https://docs.dagster.io/_modules/dagster/_core/definitions/decorators/asset_decorator#asset

This becomes a blob of white text on a black background, it makes it hard to digest.

1 reply

PedramNavid Aug 27, 2024
Maintainer Author

This is high on our list of things to fix! We'll have more to show soon.

v1gnesh · 2024-08-02T10:35:47Z

v1gnesh
Aug 2, 2024

Role: Mainframer
Experience with Dagster: Low-Med
Feedback:

I don't know why but I feel completely lost in the API reference pages. So much so that it's far easier to get snips of it via helpful IDE pop-ups.
Code examples in docs that are automatically validated as new releases come up
There must be < 20 common patterns with native dagster (i.e., without dbt, snowflake, or other external components). Enumerate it all as examples so we can get going right away. Right now, at least I find it frustrating to switch between guides, concepts, and API reference; left on my own to try to patch them together.
This may be hard, but maybe try not to surface any info about in-progress features until it's ready in some capacity. Trying to make source/external assets work with partitions, and asset checks has really irritated me. I can't even get the given asset check example to work in any dagster 1.6, 1.7.x

Looking forward to 1.8. I hope this is the release that shows external assets the same as dagster-materialized assets in the UI, along with the excellent partition & checks sections in the "pill/card".

0 replies

gnilrets · 2024-08-08T16:15:31Z

gnilrets
Aug 8, 2024

Role: Sr Staff Data Engineer
Experience with Dagster (Low/Med/High): Med

The docs often fail to help us understand why to use a certain aspect of dagster. Here's one I just ran into today while I was trying to learn more about asset checks:

Sometimes, it makes sense for a single function to materialize an asset and execute a check on it.

Can you elaborate, or provide examples of when using this method might be beneficial over other methods?

0 replies

charlesbmi · 2024-08-20T04:23:01Z

charlesbmi
Aug 20, 2024

Role: Software tech-lead
Experience with Dagster: Low
Your feedback:
I was trying to figure out when an @asset should return a value vs a MaterializeResult, and I didn't see much information on it.

For example, the starter code on dagster.io

@asset
def country_populations() -> DataFrame:
    df = read_html("https://tinyurl.com/mry64ebh")[0]
    df.columns = ["country", "pop2022", "pop2023", "change", "continent", "region"]
    df["change"] = df["change"].str.rstrip("%").str.replace("−", "-").astype("float")
    return df

@asset
def continent_change_model(country_populations: DataFrame) -> LinearRegression:
    data = country_populations.dropna(subset=["change"])
    return LinearRegression().fit(get_dummies(data[["continent"]]), data["change"])

@asset
def continent_stats(country_populations: DataFrame, continent_change_model: LinearRegression) -> DataFrame:
    result = country_populations.groupby("continent").sum()
    result["pop_change_factor"] = continent_change_model.coef_
    return result

but I was confused that most of the docs' @asset tutorials actually just write to file and return None or MaterializeResult. Should a function return both a value and a MaterializeResult to attach metadata to it? Or does a return-value mean the metadata should get handled by an I/O manager?

0 replies

j-blackwell · 2024-09-16T12:42:55Z

j-blackwell
Sep 16, 2024

To add, documentation on code locations is also sparse:

how many assets, jobs, etc. is "too big" for one code location
how can you use assets as downstream dependencies between code locations
how can you share resources / io managers between code locations
how can you share utility functions / helpers between code locations
how should you structure your repo(s??) and ci/cd to handle this?

0 replies

PedramNavid · 2024-09-26T19:20:44Z

PedramNavid
Sep 26, 2024
Maintainer Author

Thank you all for your input! Closing this discussion in favor of feedback on our new docs site: #23031

0 replies

Feedback on our Documentation #23031

PedramNavid Jul 16, 2024 Maintainer

Replies: 24 comments · 28 replies

PedramNavid Jul 17, 2024 Maintainer Author

PedramNavid Aug 27, 2024 Maintainer Author

PedramNavid Jul 17, 2024 Maintainer Author

docs feedback

organization

concepts

tutorials

examples

roleplay

test the docs

astroturfing

PedramNavid Jul 17, 2024 Maintainer Author

PedramNavid Jul 17, 2024 Maintainer Author

PedramNavid Aug 27, 2024 Maintainer Author

PedramNavid Sep 26, 2024 Maintainer Author

PedramNavid
Jul 16, 2024
Maintainer

Replies: 24 comments 28 replies

PedramNavid Jul 17, 2024
Maintainer Author

PedramNavid Aug 27, 2024
Maintainer Author

PedramNavid Jul 17, 2024
Maintainer Author

PedramNavid Jul 17, 2024
Maintainer Author

PedramNavid Jul 17, 2024
Maintainer Author

PedramNavid Aug 27, 2024
Maintainer Author

PedramNavid
Sep 26, 2024
Maintainer Author