-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on data exploration tool needs based on user conversations #3832
Comments
TL:DR: By creating superset datasets for every table in our database, users can easily create a chart and download it as a CSV or image. However, filtering all the rows and columns without SQL feels a little cumbersome. It's still unclear what the cost and performance repercussions are of increasing the max row size for chart and table creation. 1. Can a user use a pre-built dashboard to filter/subset data without SQL, then download as Excel? what are the limitations?Short answer is yes but the number of rows they can view in a dashboard is limited. I was able to create a dashboard with a single table chart that can be filtered and downloaded as a CSV or excel file. This feels like the most user-friendly no-SQL method filtering and downloading a large subset of table rows and columns. There are a couple of drawbacks:
2. can a user use a pre-defined "Dataset" to filter/subset data without SQL, then download as Excel? what are the limitations?Short answer is yes, but superset Explore UI is more for building charts and not for quickly exploring and downloading data. When you click on a dataset, it brings to the Explore UI. The left hand side shows a list of columns and any metrics (aggregated metrics defined in superset, we probably don't need these rn). You then have to select the Table chart, add the columns you need, apply any filters, specify the number of rows you want, and click "update chart". You can then download the displayed data as a CSV or excel using the sideways hamburger button in the upper right. There are some limitations:
3. can a no-SQL user create charts & graphs?Yes! It is the same process for creating a table chart. This felt like the most natural exploration method. Users can quickly add a couple of columns and metrics to a line or bar chart, plot the results, and download it as a PNG. However, I think we run into the same row limit issue as the other workflows. This is the description of the Row Limit option:
This makes me think if your filter returns > the row limit, your chart/table will be missing data. Again, it's unclear what cost and performance repercussions are of increasing the row limit beyond the default max of 50k. We would also need to programmatically create datasets using the API. This seems pretty doable by creating dataset yaml files for each table and uploading them using the API. Another advantage of creating datasets for every table is that we'd get table and column descriptions! Yay! You specify the description in the UI or in the yaml file. However, the formatting is limited: |
I think it's important to be able to download the data for local use in Excel. Many of the orgs / individuals we spoke with are very comfortable doing analysis and charting in Excel, and if they have to re-learn how to do that analysis or make charts through a web UI, I got the sense that many would not be interested. At the same time, if we know that the "filter & download via a web UI" folks are using Excel locally, that puts an upper bound on the size of the data they'll actually be able to work with, and we probably don't need to worry about making it easy to download more than that. What are the hard limits / practical limits for how many rows Excel can handle? In many cases I think you'd see folks selecting data associated with a single utility or power plant, which would often be less than 100K rows.
At a minimum I think we need:
This is why I started poking around to see what kinds of open source "data catalog" projects exist. It seems like those might (?) be closer to providing some of the services we need, at least for data discoverability. Not sure if any of them allow filter-and-download-CSV though. |
We might want to talk to some potential Excel users and see if they would be comfortable using Excel's built-in PowerQuery (Get & Transform) functionality to read data from an API. It seems like somewhat advanced spreadsheet usage, but would have the benefit of being an incremental thing to learn within a tool that the user is already familiar with and committed to, and hopefully it's a piece of functionality that's well documented and supported by Microsoft. It looks like similar functionality in Google Sheets requires a 3rd party extension. |
Would love for people to test out filtering a table chart on a Dashboard before we test this with users! |
Poked around OpenMetadata. Pros/ConsPros: * there's a search bar & it does a nice fuzzy search of the full text. great for the "do you have data that interests me?" part. * better than ctrl-f in that it does real search things, presumably doing stemming & TF-IDF stuff. * it shows sample data! * it also shows column-based metrics: how many of these values are unique? how many are null? etc. * you can download tables directly as CSVs * I guess collaborative commenting on data is cool?Cons:
Unsurprisingly, it focuses on answering the "do you have data that interests me?" but does a good job of it - giving lots of high level information about a dataset. Seems like it also does a lot more than we really need. While our metadata is too big and complicated for a human brain to comprehend, it's not at the scale that a lot of these data catalog tools are designed for. The core thing we need is, I think:
Which we could get with Ctrl-F on our existing data dictionary, and replacing the "browse or query this table in Datasette" link with the other ones. And we could incrementally improve that by injecting some Javascript to allow filtering of the view I guess we could do the "Superset Meta-dashboard", though I think that doesn't support full-text search either. You'd have to go into the data explorer to be able to filter by "table description ILIKE 'generation'" |
Thanks for looking into open metadata @jdangerx! I've also been poking around at the metadata tools. Let's create a separate issue or discussion to add our thoughts. |
Questions for other Inframundo folks
|
We'll need to scream at people about the 100k row limit, probably by enforcing at least one column as a filter on these charts (e.g., plant ID, year or state). And, we'll need to scream about where the download link is. But I think this dashboard does the trick. Note that selecting "Show all" even for a year of data in this dashboard causes some major slowdowns, so we'll want to encourage fairly small slices. In terms of chart settings:
I think this would be mostly a welcome page with some information about the project, links to the metadata etc. We could show a few visualizations without the option to download (if this is possible) to encourage registration, but as a MVP I think this could mostly be pulled from our site and the draft dashboard I'd made some time ago? |
@bendnorman I did some playing around with filtering on the dashboard, and overall it feels great! Here's a couple thoughts I had while playing around:
As for the I could see it being a kind of big problem if someone downloaded data thinking it was complete, then did analysis on that data and came up with totally erroneous results due to missing data. |
I tried setting this up! You can "Add/Edit filters" on the dashboard and add a "numerical range" filter. This turns into a slider, without any options to type in specific range values. It's a "meh" experience. Fast, but you'll have to set the range slightly wider than you need, then download and re-filter locally, which is annoying.
I think it's not good but good enough. We can funnel users with greater needs towards more advanced functionality. Side note: while I believe people don't want to "skill up", I also think people are smart enough to modify the
I think we'd mostly be using them to show off our cool data & the tools you can build with it. Then people who are interested in building their own tools can register. I like Ella's point about the landing page being public too! |
@zaneselvans thoughts from Slack:
|
Great feedback thanks y'all. I'll start creating some new issues for user testing. I was able to add the row count has a big number to the dashboard and have it turn red if it goes over our row limit which is another simple way to remind folks. I think ideally we'd have a very clear download button that pops up a warning window if a user is trying to download more rows than our limit. |
I also added a sample data dictionary to that dashboard. |
Overview
We had some notion that we wanted to make a tool that would replace Datasette & allow access to all of our data, make charts & graphs easier, and let us gather some limited information about who our users are.
We've learned a lot by talking to many users over the last month - does that change anything about what our actual requirements are?
We've also learned about Superset & its strengths/limitations by prototyping it and testing it out over the last two months. We have a few questions to answer about its capabilities, and need to evaluate whether or not we want to continue trying to push it out to our users.
Success criteria
We also need answers for the following questions about Superset:
Tasks
The text was updated successfully, but these errors were encountered: