Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension for DataFrame + Jupyter notebooks that adds properties that return concrete columns #25

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pgovind
Copy link

@pgovind pgovind commented Feb 11, 2020

Also spits out the API for the generated type. Looks like this at the moment:

InteractiveDataFrame

The magic command also takes in an optional TypeName OutputType to generate public class OutputType: DataFrame instead of the default InteractiveDataFrame.

cc: @eiriktsarpalis

@pgovind
Copy link
Author

pgovind commented Feb 11, 2020

Just after I put this up, Jon pinged me about how to make the following option work in the magic command:

#!AddStrongTypes DataFrame df

I'll update the PR for it soon, but that should be a minor update

{
Handler = CommandHandler.Create(async (FileInfo csv, string typeName, KernelInvocationContext context) =>
Handler = CommandHandler.Create(async (string dataFrameName, string typeName, KernelInvocationContext context) =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why did you remove the csv argument from the command arguments? The current approach requires 3 commands to initialize a frame with strongly typed columns, shouldn't it be possible to do it with just one?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed I think we took a step back in usability here. I think it is a good addition to add the option to take an existing DataFrame as input, as that could be useful. But I would expect a "simple" case to still be possible.

My thinking on the simple case would be:

#!LoadDataFrame csv C:\blah\housing.csv
housing.Description()

Note, we can try to infer both the generated type and the variable name from the file name. In this case, we would generate HousingDataFrame Type and housing variable which is populated with the data from that file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, my initial thought was for the magic command to generate strongly typed columns on any DataFrame. So in my head, I kinda separated out reading a csv file from generating a new type. I'll add the csv option back in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so our LoadCsv method can sometimes fail to infer the correct types. In these cases, this extension will generate the wrong column types for it's properties, but this is ok since we can pass in a regular DataFrame and generate strong properties on it.

"typeName",
getDefaultValue:() => "Foo"));
"--type-name",
getDefaultValue: () => "InteractiveDataFrame"));
Copy link

@eiriktsarpalis eiriktsarpalis Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't using a constant as the default type name result in shadowing of any previous InteractiveDataFrame declarations? The default should probably be qualified by some kind of cell identifier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does result in shadowing but I figured that would be the expected behavior, since everything else in a notebook behaves this way. Adding a cell identifier would make that more obvious, you're right. I'll look into what identifier information is available here and update this code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky. Cell identifiers aren't something you can consistently get access to across Juptyer frontends, despite various workarounds that exist to accomplish this. Remember that cells are transient. Submissions are more important in the interactive session, and the types will align to the specific submissions. In some languages you won't be able to access shadowed types except via closure, so this can also be hard to reason about.

{{
get
{{
int columnIndex = Columns.IndexOf(""{columnName}"");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just return Columns[""{columnName}""];?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Columns[columnName] doesn't exist on the DataFrame side in 0.2.0.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why not return this[""{columnName}""];?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because return this["columnName"] will go away in the next preview in favor of Column[columnName].

{
string typedDataFrame = GetTypedDataFrameWithProperties(df, typeName, out StringBuilder prettyFormatter);
context.Publish(new DisplayedValueProduced($"Created type {typeName}: ", context.Command));
await context.DisplayAsync(prettyFormatter.ToString());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we displaying different code than what we are generating/compiling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this looked better since we're only seeing the API rather than spitting out the code being compiled. Would you rather see the implementation too? I tried that first and it looked rather noisy. I would think most folks aren't interested in looking at the implementation?

Copy link
Owner

@eerhardt eerhardt Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that someone could use this tool to generate code that they can use in their project and use it in the library/application. Not just in a notebook.

Outputting the generated code could be an option that is off by default.


namespace Microsoft.Data.Analysis.Interactive
{
public class DataFrameKernelExtension : IKernelExtension
{
public Task OnLoadAsync(IKernel kernel)
bool _generateCsvMethod = false;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to flow this bool down through the method calls as a parameter, instead of setting a field property here? What happens if multiple threads are calling this object at the same time?

{
friendlyName = friendlyName.Remove(backTick);
}
friendlyName += "<";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there should be an existing utility function somewhere that does this: Given a Type, write out its language specific name.

Note that as written, this will only work for C#.

{
if (csv != null)
{
HandleCsvAsync(csv, context);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These HandleXXXAsync calls need to be await'd.

{
if (csv != null)
{
HandleCsvAsync(csv, context);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't HandleCsvAsync get the --type-name and --dataframe-name parameters passed to it?

Base automatically changed from master to main March 5, 2021 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants