-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embedded code outputs abstraction #681
Comments
Thanks @chrisjsewell for this really well thought out summary of the issue. I am still digesting the options and issues here, but here are some initial comments. I suspect the "optimal" approach may be a phased one. Short Term: The proposal for delaying I guess the main cost is:
and the delayed conversion to Medium Term: Assigning some kind of Would it be a valid assumption/limitation that a For the
and then:
would print |
My response to this is thinking through three aspects: (1) javascript variables and interactive components; (2) not having the notebooks available when you want to re-render markup (I want to pull someone else’s variables into my new document); and (3) integration of live computation and pulling in kernels from Thebe. A few recent that @stevejpurves did are here and here, these show some different prototype implementations of Thebe on how to hook back into Jupyter. TLDRRowan likes using cell IDs, with a target syntax (comment) to make these more accessible to existing interfaces. Rowan is anti “dynamic kernel injection”, as it is more difficult to store results and doesn’t compose well across documents. I believe # (myPlot)=
plt.plot([1,2,3]) This could be easily glued into For naming outputs, we could render a custom mimetype (fancy json) # (myDataForMyST) =
display.JSON({a: 1, b: [1,2,3]}) This could be accessed by Exposing the MyST target comment in a code cell I also think means that we wouldn’t have to work with For option |
thanks @rowanc1
Are you suggesting here that the |
The other consideration here, that I didn't mention in the initial comment: if you want to reference "cross-document", then how do you cache outputs, and also ensure they are up-to-date? Let's say I have two notebooks:
|
Thanks @chrisjsewell for this issue and all the information/options added in one place!
I think that a cell id based approach with some mechanism around exposing variables within the cell is a good way to go. And using the mimetype output mechanism in some way for communicating output variables seems a natural choice. Don't we only need a language/kernel implementation for convenience, or if/where there is no native equivalent to CellId's is essentially how Note: the following is a bit out of scope maybe but related - relating to reactive controls in a Juptyerbook which can also drive computation and fresh outputs using controls that are in the document post sphinx build, we also have to think about how to handle "inputs" for a given cell -- where we identify variables that can be changed outside the cell but should trigger computation and a new output to be generated. We're currently experimenting with cell ids plus markup in comments to implement that in the prototypes that we have, which works fine and is the approach that for example Colab also takes in its notebook widget hookup. However, "inputs" could also be exposed in via the same mechanism for outputs, as the code cell is executed at first with its default value. The advantage of the comment based markup approach we've taken so far on inputs is that it is definitely kernel agnostic but also more accessible to the majority of users than using the |
The "use MyST target syntax within code cell comments" is a clever idea, I hadn't thought of that one. I agree with @chrisjsewell's concern re: language-specific, but if you made some strict rules about how that label could be defined, maybe it wouldn't be too bad as @rowanc1 notes above. What I like about that idea:
Note: this also reminds me a bit of how nbdev uses comments to control rendering behavior, and that has proven popular in that community AFAIK. So how would it work in implementation? It sounds something like:
Things I think we should strive for:
|
This could certainly be viable 👍
This is the primary issue of executablebooks/MyST-NB#380 though:
TLDR: it makes life a lot easier to embed the outputs (converted to AST) as early as possible, This will likely be an issue across any implementation, not just sphinx |
One thing that might confuse people though, is that now we would be mixing metadata about a cell that is embedded as comments (target labels) and other metadata that piggy backs on the cell JSON metadata (tags, and everything else basically). Would that be confusing to people? |
To @choldgraf's point - putting this in similarly to a tag might be better (vscode being still hard to do that). Especially if that is similar to other existing workflows that are working today. The advantage of that is that you can edit the tags easily, whereas the ID is likely something that is less exposed to end-user manipulation (and is still new). I think I only suggested the manual tag because of the initial options proposed, reflecting now: going with a unique tag seems like a really strong approach! For caching on the AST: I am in the process of writing our own "version of sphinx" in node, and I don't think that the caching challenge you mention @chrisjsewell is really much of an issue. There are just a few different stages where you need to cache, rather than just the end state for a page (i.e. before filling in cross-references, on image manipulations/screenshots, etc.). Ours is a custom implementation that is designed for this though, and I recognize that this is more difficult with sphinx. This is a similar problem to sphinx caching the left-navigation pane on pages, when it is out of date on a cached page. This always annoys me, but is fixed by a clear/re-render of all pages. I think that is exactly the same as this issue, no? |
re: tags, do you mean something like:
|
Yep! However, the tag maybe should be more like |
ughh, I'm not a great fan of "abusing" tags to do everything: they have their place and are helpful user UI but, a tag is very different semantically to an identifier.
Well, knowing our work on https://github.com/executablebooks/myst-spec and https://github.com/executablebooks/mystjs, I'd say there is still a decent way to go, to reach the complexity of sphinx 😅 |
Should we consider transparency of the process of exposing variables over being kernel agnostic? This thought comes from a conversation with someone who is an RMarkdown user, and specifically:
This is readily possible in RMarkdown because of the single language implementation I guess combined with the fact that they are bound/constrained by the sphinx render process outlined at the top of this issue. Either way should transparency by a priority requirement in this? |
Thanks @stevejpurves, but can you explain a little more what you mean by transparency here? |
@chrisjsewell what I meant is that no markup or additional code required in order to expose an output. (The glue'ing []or whatever equivalent process is not visible to the user) Instead, variables computed in a code chunk are by default "available" and their value can be displayed with a rule-like syntax in the content e.g. "r total_area" (where " are backticks) would display The last section of https://www.hzaharchuk.com/rmarkdown-guide/content.html shows the example and as far as I can tell, no declaration is needed in the front matter to achieve that. (Perhaps because referenced variables are identified early and resolved during computation?) |
I think it's worth describing two related but different workflows here:
It sounds like the use-case @stevejpurves mentions from RMarkdown is from usecase (1), but how does this pattern work for use-case (2)? Also regarding @chrisjsewell's point about over-loading tags, I do agree this isn't the "intended use case" for tags either. I'd feel fine just using a dedicated cell metadata key for this, rather than over-loading tags, and use this as an impetus to improve the UI/UX around cell-level metadata in JupyterLab (e.g. maybe doing something similar to how they recently overhauled the settings UI). If people wrote their notebooks as text files, it would be pretty simple, e.g., something like:
(note maybe we'd just use |
yep thats also what I felt @stevejpurves was referring to, and relates to my "Using dynamic kernel injection" section in the initial comment. In may indeed be desirable to support both use cases, as separate concerns, e.g. (as-per my initial comment) |
Hello! @stevejpurves pointed me to this conversation. There are two types of documents used in scientific communication: lab notebooks and traditional scientific articles. Quoting "The computational notebook of the future" by Konrad Hinsen:
Jupyter Notebook is great to write lab notebooks. I don't see a great need to be able to embed the value of variables inside Markdown cells as the user can change the writing style to accommodate the lab notebook structure. To write "literate programming" scientific articles, it is interesting to be able to embed the value of variables inside Markdown cells. For example, the user calculates a Spearman correlation coefficient with associated p-value using SciPy and glue/reference the p-value inside the Markdown cell. Embed the value of variables inside Markdown cells has many technical challenges as mentioned in the first comment. From a user perspective, I see three approaches to write "literate programming" scientific articles. Concatenate and Run AllExample: bookdown. The master document will be create by the concatenation of each raw individual document. The master document is executed linearly and references to code output replaced by the code output itself. After the code output replacement, the master document is converted to the desired output (HTML, PDF, DOCX). Pros:
Cons:
Link and EmbedExample: Curvenote The user has two independent documents: the lab notebook and the traditional scientific article. All the computation is done in the lab notebook. Users can link the the lab notebook to the traditional scientific article and embed any code output of the lab notebook into the article. Both document can be converted to the desired output (HTML, PDF, DOCX). Pros:
Cons:
Staple when FinishedExample: Jupyter Book Each individual document is executed independently. All the individual documents with the code output are stapled together and converted to the desired output. Pros:
Cons:
Final ConsiderationThe "Link and Embed" approach used by Curvenote has the biggest potential to lower the barrier to new adopters of "literate programming" scientific articles and worth investing on it. The "Concatenate and Run All" approach used by bookdown does not work well for long documents or documents that include costly computational code. The "Staple when Finished" approach used by Jupyter Book is ideal for technical tutorial or books but has big limitations to "literate programming" scientific articles. |
@chrisjsewell it seem that this would mean executing code on demand when a markdown cell containing a variable is encountered? wouldn't that lead to unexpected results? i.e. computation is designed and execution order set by the notebook's linear flow, yet but also executing code from myst content (that could equally be outside a notebook in an Apart from that and answering to @choldgraf's question - could cross document support not be provided but first performing a full pass over all markdown content to build a map of all required variables (perhaps scoped to notebooks somehow via their identifier in the myst expression) ahead of any computation, and these can then be harvested at compute time when the kernel live? i.e. what currently constraints notebooks computation to be the first step? Another point is that I'm a bit confused over:
My understanding of the |
Not in executablebooks/MyST-NB#382 no, because it "restricts" itself to only running variable evaluation, i.e. nothing that would change the state of the kernel, only "querying" its current state |
This is definitely a key consideration for me: the trade-off, between execution and caching I believe the famous quote is
Ideally, when users want their rendered pages to be showing the outputs which correspond to the latest input code. But also, want to minimise the amount of re-executing / re-rendering they have to do, to stay up-to-date |
Aim
Within jupyter-book, and EBP in general, it is a desire for users to be able to embed the outputs of Jupyter Notebook code cell execution within the body of their documents.
For example, referencing the value of a calculation:
with some form of "placeholder" syntax:
As well as simple variable referencing, one would also like to embed "richer" outputs, such as images and HTML.
Sphinx recap
Before discussing potential abstractions, and their pros/cons, it will be helpful to recap the basic sphinx build process phases:
One difficulty with the outputs of Jupyter notebook code cells, is that they can provide multiple output formats (a.k.a mime types), which can only be "selected" in phase (3)
Potential abstractions
A number of potential abstractions are discussed below, with their pros and cons
Current myst-nb glue abstraction
In myst-nb v0.13, there is the
glue
function & roles/directives.This is implemented for IPython kernels only, whereby one "binds" a variable name to a variable by the
glue
function:and placeholder syntax look like:
All mime types for such outputs (such as
text/plain
,text/html
, image/png`, ...) are saved to a cache, during phase (1) of the sphinx build.Then, during phase (3), placeholder roles/directives are replaced with a suitable mime type for that output format, taking the mime type's content and converting it to AST, before injecting it into the document's AST.
Pros:
Cons:
Refactored myst-nb glue abstraction
The refactor in executablebooks/MyST-NB#380 is not primarily aimed at
glue
, but it does intrinsically change how it works. It primarily addresses the issue of AST creation in phase (3), moving it to phase (1).In its current form, the implementation precludes cross-document use, a proposal though is to use the form:
This would fix the issue of requiring variable names to be unique across the project
It would require a "bi-modal" approach though, whereby
glue
without thedoc
option would proceed by directly converting outputs to AST in phase (1), but with thedoc
option AST would still need to be generated in phase (3)Using code cell IDs (or metadata)
As discussed above, a big issue with the
glue
abstraction above, is that it is only currently implemented for Python, and would require different implementations for different kernels.One way round this, is to assign an ID to each code cell, then use this as the reference for embedding code outputs.
This ID could either be assigned within the cell's metadata or also now via the recent addition of: https://nbformat.readthedocs.io/en/latest/format_description.html#cell-ids
For example, if one had a code cell like:
This cell actually has four outputs, and so this may require additional logic, to specify which output is being referred to (or limiting to only the final output).
Using the
user_expressions
kernel featureuser_expressions
are a feature of the Jupyter client/kernel, which allow expressions to be evaluated after execution of the code cell's main content, and bound to variable names, see: https://jupyter-client.readthedocs.io/en/stable/messaging.html#executeIt would be implemented for example like:
This overcomes an issue with the above cell ID:
However, similar to IDs
user_expressions
are not currently implemented for any Notebook editor/rendersAdditional to this limitation, it should be noted that this feature of the client is quite under-documented and, appears to be unimplemented in some kernels.
The IPython kernel's implementation is to call https://docs.python.org/3/library/functions.html#eval on each expression: https://github.com/ipython/ipython/blob/d9b5e550b673db900a08d03740ec0ce94e1b8feb/IPython/core/interactiveshell.py#L2606-L2631
This is somewhat problematic, since it means that it is technically possible for the expression to change the "state" of the python interpreter. This makes the order of execution important, and one feels it would have been a better design choice to make the
user_expressions
format a list rather than a dict.For nbclient, a proof-of-principle implementation can be found at jupyter/nbclient#160
Using dynamic kernel injection
A somewhat radically different approach, would be to allow the Jupyter client to evaluate variables within the Markdown cells, during execution.
For example, as demonstrated in executablebooks/MyST-NB#382
Here, the user does not need to provide any "additional" binding of variables to variable names, it simply utilises the binding already present in the target kernel language.
As shown, the variable's output is also specific to where in the documentation it is evaluated, dependent on the state of the kernel at that point in the execution flow.
Pros
Cons
This is also somewhat similar to https://github.com/agoose77/jupyterlab-imarkdown, which arose from the discussion in https://discourse.jupyter.org/t/inline-variable-insertion-in-markdown/10525/126.
Here, the outputs of such evaluations are stored as attachments, on the markdown cell.
The text was updated successfully, but these errors were encountered: