Markdown Embedding

Add support for reading markdown files that contain embedded dftxt data within one or more fenced code blocks.
rocketboosters · Jan 24, 2024 · 993235a · 993235a
1 parent 78c0f8c
commit 993235a
Show file tree

Hide file tree

Showing 16 changed files with 579 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -67,6 +67,16 @@ Clarissa Dalloway  Mrs. Dalloway          1925
 Toad               The Wind & the Willow  1906
 ```
 
+It's also possible to embed dftxt into Markdown files with fenced code blocks that use
+the `df` or `dftxt` type signifier. Multiple fenced code blocks will be collectively
+extracted into the loaded DataFrame, which makes inline commenting of blocks quite
+useful.
+
+For examples of what that looks like see:
+
+- [Markdown with dftxt Example](./dftxt/tests/_io/_markdown/scenarios/multiple_frames/source.md)
+- [Single DataFrame broken out across multiple blocks](./dftxt/tests/_io/_markdown/scenarios/single_frame/source.md)
+
 ## Benefits
 
 The benefits of the dftxt DataFrame serialization format include:
@@ -194,6 +204,20 @@ Euro Zone       Euro      0.924       0.951       0.846       0.877       0.893
 # The values here are yearly average currency exchange rates converting into USD.
 ```
 
+#### Embedded in Markdown
+
+Markdown is a fairly ubiquitous way to create human-readable documentation that also
+renders nicely in IDEs and code collaboration tools. As such, dftxt supports embedding
+dftxt data within Markdown as fenced code blocks (triple backticks) that have the `df`
+or `dftxt` specifier after them. It's possible to specify multiple DataFrames this way
+and break DataFrames up into multiple markdown fenced code blocks for inline commenting
+where desirable.
+
+For examples of what that looks like see:
+
+- [Markdown with dftxt Example](./dftxt/tests/_io/_markdown/scenarios/multiple_frames/source.md)
+- [Single DataFrame broken out across multiple blocks](./dftxt/tests/_io/_markdown/scenarios/single_frame/source.md)
+
 ### 3. Diff/Code Review Friendly
 
 The benefits of the dftxt file format that make it human-friendly are also what make it

diff --git a/dftxt/_io/_markdown.py b/dftxt/_io/_markdown.py
@@ -0,0 +1,75 @@
+import re
+import typing
+
+_START_FENCE_PATTERN = re.compile(
+    r"(^|\n)(?P<fence>(```|~~~))(?P<type>(dftxt|df))[ \t]*(?P<args>[^\n]*)\n"
+)
+_END_FENCE_PATTERN = re.compile(r"\n(```|~~~)")
+
+
+def _parse_args(args: str) -> typing.Dict[str, str]:
+    """Parse the arguments of a dftxt fence."""
+    exploded = [a.strip() for a in re.split(r"\s+", args)]
+    if not exploded:
+        return {"name": "", "action": "append"}
+
+    if exploded[0] == "...":
+        return {"name": "", "action": "wrap"}
+
+    name = exploded[0]
+    if name.endswith("..."):
+        name = name[:-3]
+        action = "wrap"
+    else:
+        action = "append"
+    return {"name": name, "action": action}
+
+
+def _combine_frame_sections(frame_sections: typing.Dict[str, typing.List[str]]) -> str:
+    """Combine the extracted frame sections into a dftxt file."""
+    keys = list(frame_sections.keys())
+    if not keys:
+        return ""
+
+    if len(keys) == 1 and keys[0] == "":
+        return "{}\n".format("\n".join(frame_sections[""]).rstrip())
+
+    frames: typing.List[str] = []
+    for key in keys:
+        header = "{}---".format("" if not frames else "\n")
+        if key:
+            header += f" {key} ---"
+        frames.append(
+            "{}\n\n{}".format(header, "\n".join(frame_sections[key]).rstrip("\n"))
+        )
+
+    return "{}\n".format("\n".join(frames).rstrip())
+
+
+def extract(markdown: str) -> str:
+    """Extract dftxt from a markdown file."""
+    cleaned = markdown.replace("\r", "")
+    offset = 0
+    frame_sections: typing.Dict[str, typing.List[str]] = {}
+    while offset < len(cleaned):
+        opening_match = _START_FENCE_PATTERN.search(cleaned, offset)
+        if opening_match is None:
+            offset = len(cleaned)
+            break
+
+        ending_match = _END_FENCE_PATTERN.search(cleaned, opening_match.end())
+        if ending_match is None:
+            offset = len(cleaned)
+            break
+
+        offset = ending_match.end()
+
+        args = _parse_args(opening_match.group("args"))
+        section = cleaned[opening_match.end() : ending_match.start()]
+        if args["action"] == "wrap":
+            section = f"\n\n{section.strip()}"
+        if args["name"] not in frame_sections:
+            frame_sections[args["name"]] = []
+        frame_sections[args["name"]].append(section)
+
+    return _combine_frame_sections(frame_sections)