-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension for DataFrame + Jupyter notebooks that adds properties that return concrete columns #25
base: main
Are you sure you want to change the base?
Conversation
Just after I put this up, Jon pinged me about how to make the following option work in the magic command:
I'll update the PR for it soon, but that should be a minor update |
{ | ||
Handler = CommandHandler.Create(async (FileInfo csv, string typeName, KernelInvocationContext context) => | ||
Handler = CommandHandler.Create(async (string dataFrameName, string typeName, KernelInvocationContext context) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why did you remove the csv argument from the command arguments? The current approach requires 3 commands to initialize a frame with strongly typed columns, shouldn't it be possible to do it with just one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed I think we took a step back in usability here. I think it is a good addition to add the option to take an existing DataFrame
as input, as that could be useful. But I would expect a "simple" case to still be possible.
My thinking on the simple case would be:
#!LoadDataFrame csv C:\blah\housing.csv
housing.Description()
Note, we can try to infer both the generated type and the variable name from the file name. In this case, we would generate HousingDataFrame
Type and housing
variable which is populated with the data from that file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, my initial thought was for the magic command to generate strongly typed columns on any DataFrame. So in my head, I kinda separated out reading a csv file from generating a new type. I'll add the csv option back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so our LoadCsv
method can sometimes fail to infer the correct types. In these cases, this extension will generate the wrong column types for it's properties, but this is ok since we can pass in a regular DataFrame
and generate strong properties on it.
"typeName", | ||
getDefaultValue:() => "Foo")); | ||
"--type-name", | ||
getDefaultValue: () => "InteractiveDataFrame")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't using a constant as the default type name result in shadowing of any previous InteractiveDataFrame
declarations? The default should probably be qualified by some kind of cell identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does result in shadowing but I figured that would be the expected behavior, since everything else in a notebook behaves this way. Adding a cell identifier would make that more obvious, you're right. I'll look into what identifier information is available here and update this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky. Cell identifiers aren't something you can consistently get access to across Juptyer frontends, despite various workarounds that exist to accomplish this. Remember that cells are transient. Submissions are more important in the interactive session, and the types will align to the specific submissions. In some languages you won't be able to access shadowed types except via closure, so this can also be hard to reason about.
src/Microsoft.Data.Analysis.Interactive/DataFrameKernelExtension.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.Analysis.Interactive/Microsoft.Data.Analysis.Interactive.csproj
Outdated
Show resolved
Hide resolved
{{ | ||
get | ||
{{ | ||
int columnIndex = Columns.IndexOf(""{columnName}""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this just return Columns[""{columnName}""];
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Columns[columnName]
doesn't exist on the DataFrame
side in 0.2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why not return this[""{columnName}""];
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because return this["columnName"]
will go away in the next preview in favor of Column[columnName]
.
{ | ||
string typedDataFrame = GetTypedDataFrameWithProperties(df, typeName, out StringBuilder prettyFormatter); | ||
context.Publish(new DisplayedValueProduced($"Created type {typeName}: ", context.Command)); | ||
await context.DisplayAsync(prettyFormatter.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we displaying different code than what we are generating/compiling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this looked better since we're only seeing the API rather than spitting out the code being compiled. Would you rather see the implementation too? I tried that first and it looked rather noisy. I would think most folks aren't interested in looking at the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that someone could use this tool to generate code that they can use in their project and use it in the library/application. Not just in a notebook.
Outputting the generated code could be an option that is off by default.
fdd83dd
to
1a37a94
Compare
|
||
namespace Microsoft.Data.Analysis.Interactive | ||
{ | ||
public class DataFrameKernelExtension : IKernelExtension | ||
{ | ||
public Task OnLoadAsync(IKernel kernel) | ||
bool _generateCsvMethod = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to flow this bool
down through the method calls as a parameter, instead of setting a field property here? What happens if multiple threads are calling this object at the same time?
{ | ||
friendlyName = friendlyName.Remove(backTick); | ||
} | ||
friendlyName += "<"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there should be an existing utility function somewhere that does this: Given a Type, write out its language specific name.
Note that as written, this will only work for C#.
{ | ||
if (csv != null) | ||
{ | ||
HandleCsvAsync(csv, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These HandleXXXAsync
calls need to be await
'd.
{ | ||
if (csv != null) | ||
{ | ||
HandleCsvAsync(csv, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't HandleCsvAsync
get the --type-name
and --dataframe-name
parameters passed to it?
Also spits out the API for the generated type. Looks like this at the moment:
The magic command also takes in an optional
TypeName OutputType
to generatepublic class OutputType: DataFrame
instead of the defaultInteractiveDataFrame
.cc: @eiriktsarpalis