-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582
Comments
I suspect that this will not gain much traction (I could be wrong) for the following reasons: a) pd.to_() is too broad. Most outputs require a rigid format with no room for descriptions. For others it may be possible to add them in but it might be ambiguous or subjective on how to do it. You are better off just picking a format (e.g. JSON) and making a concrete suggestion. b) There are a few details missing like how "descriptions" would interact with levels of a multiindex, or would the description be applicable only to a single column, i.e. c) I can think of many ways of getting round this problem without resorting to a adding meta data to a pandas dataframe and requiring adjusting the output formats. What about just creating a second dataframe with the same columns but the content can be your description strings and then delivering two dataframe (one for content, one for meta info). In python you could also very easily take those two DataFrames create a JSON from the first and then using the json library augment the JSON to include the material from the second dataframe. |
We already have metadata in the form of |
pandas DataFrames live in memory and can be loaded from various forms of data on disk. It seems to me that documentation should live on disk, it's not clear what having documentation in memory provides. |
@lithomas1 Awesome! Could we use
Agreed. I'm indifferent with the format but
Your suggestions are great but at least in my case sharing files which require custom instructions on how to open them aren't appropriate, and for others solutions requiring custom code would impede usage. Also, typical data analysts and people who set technology policies barely know what pandas is and it'd be nice to be accommodative and welcoming to beginners.
I'm no expert but I'd imagine the cost of keeping a collection of strings in memory to be negligible and don't know of a requirement to keep on disk. Having aspects of descriptions change programmatically (eg. renaming a col) seems desirable as this relieves users of updating such changes themselves. Happy to be corrected. |
@chrisjdixon You can attach metadata with |
This seems like a fine use case for There's work on propagating the metadata through operations, and in reading / writing them for the various backends. |
Where to from here? Would Is there anything I (beginner) can do to help to progress this? |
What can I do to progress this? Soon I'll have to spend hours writing documentation and I'd far rather spend that time helping develop this. What can I do to help? |
@chrisjdixon https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.attrs.html?highlight=attrs#pandas.DataFrame.attrs already exist and mostly propagate. better doc-strings here would be super helpful. Of course adding an option to serialize these would be good too (i think we preserve these to parquet), but JSON wouldn't be too hard. |
@jreback ok, great! I'm a beginner and barely know basic Python but what would the next steps from here be? What can I specifically do? |
try adding docs strings with some examples for using .attrs |
This should be a lot easier since ⬇️ went in on |
Asked at SO: I need to share well described data and want to do this in a modern way that avoids managing bureaucratic documentation no one will read. Fields require some description or note (eg. "values don't include ABC because XYZ") which I'd like to associate to columns that'll be saved with
pd.to_<whatever>()
.Looks like JSON supports annotations and I'd love to have the option of using them with pandas, but couldn't figure out how to.
Could we please develop functionality to add descriptions in a convenient way (eg.
df[col].description = 'string'
) and have that save in output schemas? And maybe have that be selectable and show withdf.info(verbose=True)
or similar?I know documentation is boring but maintaining bureaucratic paperwork is even worse. Also, data documentation is a requirement common to big orgs and schools / unis, and I reckon providing innovative functionality to make boring tasks more enjoyable is an efficient way of getting more people to
stop using exceluse pandas and newer technology in general, making the world a better place.Unfortunately I don't understand pandas enough to see how this might be a stupid idea. Is this possible?
The text was updated successfully, but these errors were encountered: