From a8820fee67975f14dbaf5ec2df94645028cfa8b1 Mon Sep 17 00:00:00 2001 From: JessicaXYWang <108437381+JessicaXYWang@users.noreply.github.com> Date: Tue, 21 May 2024 08:50:17 -0700 Subject: [PATCH] chore: update generate fabric doc (#2214) * update generate fabric doc * update doc pipeline * update doc pipeline * update docgen * update doc pipeline * update doc pipeline * style --------- Co-authored-by: Mark Hamilton --- .../Multivariate Anomaly Detection.ipynb | 7 +- .../AI Services/Overview.ipynb | 82 ++++-- .../Quickstart - Isolation Forests.ipynb | 10 +- .../Quickstart - Measure Causal Effects.ipynb | 4 +- .../Quickstart - SparkML vs SynapseML.ipynb | 10 +- docs/Explore Algorithms/OpenAI/OpenAI.ipynb | 13 +- tools/docgen/docgen/channels.py | 252 ++++++++---------- tools/docgen/docgen/fabric_helpers.py | 206 ++++++++++++++ tools/docgen/docgen/manifest.yaml | 85 ++++-- 9 files changed, 459 insertions(+), 210 deletions(-) create mode 100644 tools/docgen/docgen/fabric_helpers.py diff --git a/docs/Explore Algorithms/AI Services/Multivariate Anomaly Detection.ipynb b/docs/Explore Algorithms/AI Services/Multivariate Anomaly Detection.ipynb index 4c01bb5c59..d96d707df9 100644 --- a/docs/Explore Algorithms/AI Services/Multivariate Anomaly Detection.ipynb +++ b/docs/Explore Algorithms/AI Services/Multivariate Anomaly Detection.ipynb @@ -13,7 +13,12 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [ + "alert", + "important" + ] + }, "source": [ "## Important\n", "Starting on the 20th of September, 2023 you won’t be able to create new Anomaly Detector resources. The Anomaly Detector service is being retired on the 1st of October, 2026." diff --git a/docs/Explore Algorithms/AI Services/Overview.ipynb b/docs/Explore Algorithms/AI Services/Overview.ipynb index a9eeca1d32..0ca4381453 100644 --- a/docs/Explore Algorithms/AI Services/Overview.ipynb +++ b/docs/Explore Algorithms/AI Services/Overview.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Azure AI Services" + "# Azure AI services" ] }, { @@ -12,7 +12,8 @@ "cell_type": "markdown", "metadata": { "tags": [ - "hide-synapse-internal" + "hide-synapse-internal", + "hide-azure" ] }, "source": [ @@ -22,11 +23,57 @@ { "cell_type": "markdown", "metadata": {}, + "source": [ + "Azure AI services help developers and organizations rapidly create intelligent, cutting-edge, market-ready, and responsible applications with out-of-the-box and pre-built and customizable APIs and models.\n", + "\n", + "SynapseML allows you to build powerful and highly scalable predictive and analytical models from various Spark data sources. Synapse Spark provide built-in SynapseML libraries including synapse.ml.services." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "important", + "alert" + ] + }, "source": [ "## Important\n", "Starting on the 20th of September, 2023 you won’t be able to create new Anomaly Detector resources. The Anomaly Detector service is being retired on the 1st of October, 2026." ] }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "hide-synapse-internal", + "hide-azure" + ] + }, + "source": [ + "## Prerequisites on Azure Databricks\n", + "\n", + "1. Follow the steps in [Getting started](https://docs.microsoft.com/azure/services-services/big-data/getting-started) to set up your Azure Databricks and Azure AI services environment. This tutorial shows you how to install SynapseML and how to create your Spark cluster in Databricks.\n", + "1. After you create a new notebook in Azure Databricks, copy the **Shared code** below and paste into a new cell in your notebook.\n", + "1. Choose a service sample, below, and copy paste it into a second new cell in your notebook.\n", + "1. Replace any of the service subscription key placeholders with your own key.\n", + "1. Choose the run button (triangle icon) in the upper right corner of the cell, then select **Run Cell**.\n", + "1. View results in a table below the cell." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "hide-synapse-internal" + ] + }, + "source": [ + "## Prerequisites on Azure Synapse Analytics\n", + "\n", + "The tutorial, [Pre-requisites for using Azure AI services in Azure Synapse](https://learn.microsoft.com/azure/synapse-analytics/machine-learning/tutorial-configure-cognitive-services-synapse), walks you through a couple steps you need to perform before using Azure AI services in Synapse Analytics.\n" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -54,7 +101,7 @@ "- Group: divides a group of faces into disjoint groups based on similarity ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/face/GroupFaces.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.face.html#module-synapse.ml.services.face.GroupFaces))\n", "\n", "### Speech\n", - "[**Speech Services**](https://azure.microsoft.com/services/cognitive-services/speech-services/)\n", + "[**Speech Services**](https://azure.microsoft.com/products/ai-services/ai-speech)\n", "- Speech-to-text: transcribes audio streams ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/speech/SpeechToText.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.speech.html#module-synapse.ml.services.speech.SpeechToText))\n", "- Conversation Transcription: transcribes audio streams into live transcripts with identified speakers. ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/speech/ConversationTranscription.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.speech.html#module-synapse.ml.services.speech.ConversationTranscription))\n", "- Text to Speech: Converts text to realistic audio ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/speech/TextToSpeech.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.speech.html#module-synapse.ml.services.speech.TextToSpeech))\n", @@ -70,7 +117,7 @@ "\n", "\n", "### Translation\n", - "[**Translator**](https://azure.microsoft.com/services/cognitive-services/translator/)\n", + "[**Translator**](https://azure.microsoft.com/products/ai-services/translator)\n", "- Translate: Translates text. ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/translate/Translate.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.translate.html#module-synapse.ml.services.translate.Translate))\n", "- Transliterate: Converts text in one language from one script to another script. ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/translate/Transliterate.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.translate.html#module-synapse.ml.services.translate.Transliterate))\n", "- Detect: Identifies the language of a piece of text. ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/translate/Detect.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.translate.html#module-synapse.ml.services.translate.Detect))\n", @@ -91,32 +138,13 @@ "- List Custom Models: Get information about all custom models. ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/form/ListCustomModels.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.form.html#module-synapse.ml.services.form.ListCustomModels))\n", "\n", "### Decision\n", - "[**Anomaly Detector**](https://azure.microsoft.com/services/cognitive-services/anomaly-detector/)\n", + "[**Anomaly Detector**](https://azure.microsoft.com/products/ai-services/ai-anomaly-detector)\n", "- Anomaly status of latest point: generates a model using preceding points and determines whether the latest point is anomalous ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/anomaly/DetectLastAnomaly.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.anomaly.html#module-synapse.ml.services.anomaly.DetectLastAnomaly))\n", "- Find anomalies: generates a model using an entire series and finds anomalies in the series ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/anomaly/DetectAnomalies.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.anomaly.html#module-synapse.ml.services.anomaly.DetectAnomalies))\n", "\n", "### Search\n", - "- [Bing Image search](https://azure.microsoft.com/services/services-services/bing-image-search-api/) ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/bing/BingImageSearch.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.bing.html#module-synapse.ml.services.bing.BingImageSearch))\n", - "- [Azure Cognitive search](https://docs.microsoft.com/azure/search/search-what-is-azure-search) ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/search/AzureSearchWriter$.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.search.html#module-synapse.ml.services.search.AzureSearchWriter))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "tags": [ - "hide-synapse-internal" - ] - }, - "source": [ - "## Prerequisites\n", - "\n", - "1. Follow the steps in [Getting started](https://docs.microsoft.com/azure/services-services/big-data/getting-started) to set up your Azure Databricks and Azure AI services environment. This tutorial shows you how to install SynapseML and how to create your Spark cluster in Databricks.\n", - "1. After you create a new notebook in Azure Databricks, copy the **Shared code** below and paste into a new cell in your notebook.\n", - "1. Choose a service sample, below, and copy paste it into a second new cell in your notebook.\n", - "1. Replace any of the service subscription key placeholders with your own key.\n", - "1. Choose the run button (triangle icon) in the upper right corner of the cell, then select **Run Cell**.\n", - "1. View results in a table below the cell." + "- [**Bing Image search**](https://azure.microsoft.com/services/services-services/bing-image-search-api/) ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/bing/BingImageSearch.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.bing.html#module-synapse.ml.services.bing.BingImageSearch))\n", + "- [**Azure Cognitive search**](https://docs.microsoft.com/azure/search/search-what-is-azure-search) ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.4/scala/com/microsoft/azure/synapse/ml/services/search/AzureSearchWriter$.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.4/pyspark/synapse.ml.services.search.html#module-synapse.ml.services.search.AzureSearchWriter))" ] }, { @@ -662,7 +690,7 @@ ] }, "source": [ - "## Azure Cognitive search sample\n", + "## Azure AI search sample\n", "\n", "In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using SynapseML." ] diff --git a/docs/Explore Algorithms/Anomaly Detection/Quickstart - Isolation Forests.ipynb b/docs/Explore Algorithms/Anomaly Detection/Quickstart - Isolation Forests.ipynb index 0a4f0e38ea..6461ff518c 100644 --- a/docs/Explore Algorithms/Anomaly Detection/Quickstart - Isolation Forests.ipynb +++ b/docs/Explore Algorithms/Anomaly Detection/Quickstart - Isolation Forests.ipynb @@ -12,8 +12,8 @@ } }, "source": [ - "# Recipe: Multivariate Anomaly Detection with Isolation Forest\n", - "This recipe shows how you can use SynapseML on Apache Spark for multivariate anomaly detection. Multivariate anomaly detection allows for the detection of anomalies among many variables or time series, taking into account all the inter-correlations and dependencies between the different variables. In this scenario, we use SynapseML to train an Isolation Forest model for multivariate anomaly detection, and we then use to the trained model to infer multivariate anomalies within a dataset containing synthetic measurements from three IoT sensors.\n", + "# Multivariate Anomaly Detection with Isolation Forest\n", + "This article shows how you can use SynapseML on Apache Spark for multivariate anomaly detection. Multivariate anomaly detection allows for the detection of anomalies among many variables or time series, taking into account all the inter-correlations and dependencies between the different variables. In this scenario, we use SynapseML to train an Isolation Forest model for multivariate anomaly detection, and we then use to the trained model to infer multivariate anomalies within a dataset containing synthetic measurements from three IoT sensors.\n", "\n", "To learn more about the Isolation Forest model please refer to the original paper by [Liu _et al._](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest)." ] @@ -21,7 +21,11 @@ { "attachments": {}, "cell_type": "markdown", - "metadata": {}, + "metadata": { + "tags": [ + "hide-synapse-internal" + ] + }, "source": [ "## Prerequisites\n", " - If running on Synapse, you'll need to [create an AML workspace and set up linked Service](../../Use%20with%20MLFlow/Overview.md) and add the following installation cell.\n", diff --git a/docs/Explore Algorithms/Causal Inference/Quickstart - Measure Causal Effects.ipynb b/docs/Explore Algorithms/Causal Inference/Quickstart - Measure Causal Effects.ipynb index 6bed6649fd..ef2e5381db 100644 --- a/docs/Explore Algorithms/Causal Inference/Quickstart - Measure Causal Effects.ipynb +++ b/docs/Explore Algorithms/Causal Inference/Quickstart - Measure Causal Effects.ipynb @@ -10,7 +10,7 @@ } }, "source": [ - "# Startup Investment Attribution - Understand Outreach Effort's Effect\"" + "# Startup Investment Attribution - Understand Outreach Effort's Effect" ] }, { @@ -120,7 +120,7 @@ } }, "source": [ - "# Get Causal Effects with SynapseML DoubleMLEstimator" + "## Get Causal Effects with SynapseML DoubleMLEstimator" ] }, { diff --git a/docs/Explore Algorithms/Classification/Quickstart - SparkML vs SynapseML.ipynb b/docs/Explore Algorithms/Classification/Quickstart - SparkML vs SynapseML.ipynb index e68d4bafe3..f5cac1a64d 100644 --- a/docs/Explore Algorithms/Classification/Quickstart - SparkML vs SynapseML.ipynb +++ b/docs/Explore Algorithms/Classification/Quickstart - SparkML vs SynapseML.ipynb @@ -5,7 +5,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Classification - before and after SynapseML" + "# Classification - SparkML vs SynapseML" ] }, { @@ -157,12 +157,12 @@ "## Classify using pyspark\n", "\n", "To choose the best LogisticRegression classifier using the `pyspark`\n", - "library, you need to *explicitly* perform the following steps:\n", + "library, we need to *explicitly* perform the following steps:\n", "\n", "1. Process the features:\n", - " * Tokenize the text column\n", - " * Hash the tokenized column into a vector using hashing\n", - " * Merge the numeric features with the vector\n", + " - Tokenize the text column\n", + " - Hash the tokenized column into a vector using hashing\n", + " - Merge the numeric features with the vector\n", "2. Process the label column: cast it into the proper type.\n", "3. Train multiple LogisticRegression algorithms on the `train` dataset\n", " with different hyperparameters\n", diff --git a/docs/Explore Algorithms/OpenAI/OpenAI.ipynb b/docs/Explore Algorithms/OpenAI/OpenAI.ipynb index efccf13565..b2036da722 100644 --- a/docs/Explore Algorithms/OpenAI/OpenAI.ipynb +++ b/docs/Explore Algorithms/OpenAI/OpenAI.ipynb @@ -322,7 +322,18 @@ "source": [ "### Generating Text Embeddings\n", "\n", - "In addition to completing text, we can also embed text for use in downstream algorithms or vector retrieval architectures. Creating embeddings allows you to search and retrieve documents from large collections and can be used when prompt engineering isn't sufficient for the task. For more information on using `OpenAIEmbedding`, see our [embedding guide](./Quickstart%20-%20OpenAI%20Embedding)." + "In addition to completing text, we can also embed text for use in downstream algorithms or vector retrieval architectures. Creating embeddings allows you to search and retrieve documents from large collections and can be used when prompt engineering isn't sufficient for the task." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "hide-synapse-internal" + ] + }, + "source": [ + "For more information on using `OpenAIEmbedding` see our [embedding guide](./Quickstart%20-%20OpenAI%20Embedding)." ] }, { diff --git a/tools/docgen/docgen/channels.py b/tools/docgen/docgen/channels.py index de1b79e02c..a291da285a 100644 --- a/tools/docgen/docgen/channels.py +++ b/tools/docgen/docgen/channels.py @@ -13,6 +13,7 @@ import requests from bs4 import BeautifulSoup from docgen.core import Channel, ParallelChannel +from docgen.fabric_helpers import LearnDocPreprocessor, HTMLFormatter, sentence_to_snake from markdownify import ATX, MarkdownConverter from nbconvert import MarkdownExporter from nbformat import read @@ -54,111 +55,26 @@ def process(self, input_file: str) -> (): class FabricChannel(Channel): - def __init__(self, input_dir: str, output_dir: str, notebooks: List[dict]): + def __init__( + self, + input_dir: str, + output_dir: str, + notebooks: List[dict], + output_structure, + auto_pre_req, + ): self.input_dir = input_dir self.output_dir = output_dir self.notebooks = notebooks + self.output_structure = output_structure + self.auto_pre_req = auto_pre_req + self.channel = "fabric" self.hide_tag = "hide-synapse-internal" self.media_dir = os.path.join(self.output_dir, "media") def list_input_files(self) -> List[str]: return [n["path"] for n in self.notebooks] - def _sentence_to_snake(self, path: str): - return ( - path.lower() - .replace(" - ", "-") - .replace(" ", "-") - .replace(",", "") - .replace(".ipynb", "") - .replace(".rst", "") - ) - - def _is_valid_url(self, url): - try: - result = urlparse(url) - return all([result.scheme, result.netloc]) - except: - return False - - def _replace_img_tag(self, img_tag, img_path_rel): - img_tag.replace_with( - f':::image type="content" source="{img_path_rel}" ' - f'alt-text="{img_tag.get("alt", "placeholder alt text")}":::' - ) - - def _download_and_replace_images( - self, - html_soup, - resources, - output_folder, - relative_to, - notebook_path, - get_image_from_local=False, - ): - output_folder = output_folder.replace("/", os.sep) - os.makedirs(output_folder, exist_ok=True) - - if resources: - # resources converted from notebook - resources_img, i = [], 0 - for img_filename, content in resources.get("outputs", {}).items(): - img_path = os.path.join(output_folder, img_filename.replace("_", "-")) - with open(img_path, "wb") as img_file: - img_file.write(content) - img_path_rel = os.path.relpath(img_path, relative_to).replace( - os.sep, "/" - ) - resources_img.append(img_path_rel) - - img_tags = html_soup.find_all("img") - for img_tag in img_tags: - img_loc = img_tag["src"] - - if self._is_valid_url(img_loc): - # downloaded image - response = requests.get(img_loc) - if response.status_code == 200: - img_filename = self._sentence_to_snake(img_loc.split("/")[-1]) - img_path = os.path.join(output_folder, img_filename) - with open(img_path, "wb") as img_file: - img_file.write(response.content) - img_path_rel = os.path.relpath(img_path, relative_to).replace( - os.sep, "/" - ) - img_tag["src"] = img_path_rel - else: - raise ValueError(f"Could not download image from {img_loc}") - - elif get_image_from_local: - # process local images - img_filename = self._sentence_to_snake(img_loc.split("/")[-1]).replace( - "_", "-" - ) - file_folder = "/".join( - notebook_path.split("/")[:-1] - ) # path read from manifest file - img_input_path = os.path.join( - self.input_dir, file_folder, img_loc - ).replace("/", os.sep) - if not os.path.exists(img_input_path): - raise ValueError(f"Could not get image from {img_loc}") - img_path = os.path.join(output_folder, img_filename) - img_path_rel = os.path.relpath(img_path, relative_to).replace( - os.sep, "/" - ) - shutil.copy(img_input_path, img_path) - - else: - # process image got from notebook resources - img_path_rel = resources_img[i] - img_tag["src"] = img_path_rel - i += 1 - - self._replace_img_tag(img_tag, img_path_rel) - - return html_soup - def _validate_metadata(self, metadata): required_metadata = [ "author", @@ -224,13 +140,64 @@ def _convert_to_markdown_links(self, parsed_html): link["href"] = new_href return parsed_html + def _generate_related_content(self, index, output_file): + related_content_index = index + 1 + max_index = len(self.notebooks) + related_content = [] + if max_index > 3: + related_content = ["""## Related content\n"""] + for i in range(3): + if related_content_index >= max_index: + related_content_index = 0 + title = self.notebooks[related_content_index]["metadata"]["title"] + if self.output_structure == "hierarchy": + path = self.notebooks[related_content_index]["path"] + filename = sentence_to_snake( + self.output_dir + + self.notebooks[related_content_index].get("filename", path) + + ".md" + ) + rel_path = os.path.relpath( + filename, os.path.dirname(output_file) + ).replace(os.sep, "/") + related_content.append(f"""- [{title}]({rel_path})""") + elif self.output_structure == "flat": + path = self.notebooks[related_content_index]["path"].split("/")[-1] + filename = sentence_to_snake( + self.notebooks[related_content_index].get("filename", path) + + ".md" + ) + related_content.append(f"""- [{title}]({filename})""") + related_content_index += 1 + return "\n".join(related_content) + def process(self, input_file: str, index: int) -> (): - print(f"Processing {input_file} for fabric") - output_file = os.path.join(self.output_dir, input_file) - output_img_dir = self.media_dir + "/" + self._sentence_to_snake(input_file) + print(f"Processing {input_file} for {self.channel}") full_input_file = os.path.join(self.input_dir, input_file) notebook_path = self.notebooks[index]["path"] + manifest_file_name = self.notebooks[index].get("filename", "") metadata = self.notebooks[index]["metadata"] + + if self.output_structure == "hierarchy": + # keep structure of input file + output_file = os.path.join(self.output_dir, input_file) + output_img_dir = os.path.join(self.media_dir, sentence_to_snake(input_file)) + elif self.output_structure == "flat": + # put under one directory + media_folder = ( + manifest_file_name + if manifest_file_name + else sentence_to_snake(input_file.split("/")[-1]) + ) + output_img_dir = os.path.join(self.media_dir, media_folder) + output_file = os.path.join(self.output_dir, input_file.split("/")[-1]) + + if manifest_file_name: + output_file = output_file.replace( + output_file.split("/")[-1].split(".")[0], manifest_file_name + ) + + auto_related_content = self._generate_related_content(index, output_file) self._validate_metadata(metadata) def callback(el): @@ -247,59 +214,40 @@ def convert_soup_to_md(soup, **options): return MarkdownConverter(**options).convert_soup(soup) if str(input_file).endswith(".rst"): - output_file = self._sentence_to_snake( - str(output_file).replace(".rst", ".md") - ) - html = self._read_rst(full_input_file) - parsed_html = markdown.markdown( - html, - extensions=[ - "markdown.extensions.tables", - "markdown.extensions.fenced_code", - ], - ) - parsed_html = BeautifulSoup(parsed_html, features="html.parser") - parsed_html = self._download_and_replace_images( - parsed_html, - None, - output_img_dir, - os.path.dirname(output_file), - notebook_path, - True, + output_file = sentence_to_snake(str(output_file).replace(".rst", ".md")) + content = self._read_rst(full_input_file) + html = HTMLFormatter( + content, + resources=resources, + input_dir=self.input_dir, + notebook_path=input_file, + output_img_dir=output_img_dir, + output_file=output_file, ) parsed_html = self._convert_to_markdown_links(parsed_html) elif str(input_file).endswith(".ipynb"): - output_file = self._sentence_to_snake( - str(output_file).replace(".ipynb", ".md") - ) + output_file = sentence_to_snake(str(output_file).replace(".ipynb", ".md")) parsed = read(full_input_file, as_version=4) c = Config() - c.TagRemovePreprocessor.remove_cell_tags = (self.hide_tag,) - c.TagRemovePreprocessor.enabled = True c.MarkdownExporter.preprocessors = [ - "nbconvert.preprocessors.TagRemovePreprocessor" + LearnDocPreprocessor( + tags_to_remove=[self.hide_tag], auto_pre_req=self.auto_pre_req + ) ] - md, resources = MarkdownExporter(config=c).from_notebook_node(parsed) - - html = markdown.markdown( - md, - extensions=[ - "markdown.extensions.tables", - "markdown.extensions.fenced_code", - ], - ) - parsed_html = BeautifulSoup(html) - # Download images and place them in media directory while updating their links - parsed_html = self._download_and_replace_images( - parsed_html, - resources, - output_img_dir, - os.path.dirname(output_file), - None, - False, + content, resources = MarkdownExporter(config=c).from_notebook_node(parsed) + + html = HTMLFormatter( + content, + resources=resources, + input_dir=self.input_dir, + notebook_path=input_file, + output_img_dir=output_img_dir, + output_file=output_file, ) + html.run() + parsed_html = html.bs_html # Remove StatementMeta for element in parsed_html.find_all( @@ -324,8 +272,26 @@ def convert_soup_to_md(soup, **options): ) # Post processing new_md = f"{self._generate_metadata_header(metadata)}\n{new_md}" - output_md = self._remove_content(new_md) + if "## Related content" not in new_md: + new_md += auto_related_content + output_md = re.sub(r"\n{3,}", "\n\n", self._remove_content(new_md)) os.makedirs(dirname(output_file), exist_ok=True) with open(output_file, "w+", encoding="utf-8") as f: f.write(output_md) + + +class AzureChannel(FabricChannel): + def __init__( + self, + input_dir: str, + output_dir: str, + notebooks: List[dict], + output_structure, + auto_pre_req, + ): + super().__init__( + input_dir, output_dir, notebooks, output_structure, auto_pre_req + ) + self.hide_tag = "hide-azure" + self.channel = "azure" diff --git a/tools/docgen/docgen/fabric_helpers.py b/tools/docgen/docgen/fabric_helpers.py new file mode 100644 index 0000000000..f4387558b0 --- /dev/null +++ b/tools/docgen/docgen/fabric_helpers.py @@ -0,0 +1,206 @@ +import difflib +import markdown +from bs4 import BeautifulSoup +from nbconvert.preprocessors import Preprocessor +import os +from urllib.parse import urlparse +import requests +import shutil + + +class LearnDocPreprocessor(Preprocessor): + def __init__(self, tags_to_remove=None, auto_pre_req=None, **kwargs): + """ + Initializes the preprocessor with optional remove tags. + :param remove_tags: A list of tags based on which cells will be removed. + """ + super(LearnDocPreprocessor, self).__init__(**kwargs) + self.tags_to_remove = tags_to_remove if tags_to_remove else [] + self.auto_pre_req = auto_pre_req + + def preprocess(self, nb, resources): + """ + Preprocess the entire notebook, removing cells tagged with in remove tag list + and process other cells. + """ + if self.tags_to_remove: + nb.cells = [ + cell + for cell in nb.cells + if not set(self.tags_to_remove).intersection( + cell.metadata.get("tags", []) + ) + ] + + for index, cell in enumerate(nb.cells): + nb.cells[index], resources = self.process_cell(cell, resources, index) + return nb, resources + + def add_auto_prereqs(self): + prerequisites = [ + "## Prerequisites\n\n[!INCLUDE [prerequisites](includes/prerequisites.md)]" + ] + prerequisites.append( + "- Attach your notebook to a lakehouse. On the left side, select **Add** to add an existing lakehouse or create a lakehouse." + ) + return "\n".join(prerequisites) + + def process_cell(self, cell, resources, index): + """ + Adds '> ' before Markdown cells tagged with 'alert' and an alert type. + """ + if ( + cell.cell_type == "markdown" + and ("tags" in cell.metadata) + and ("alert" in cell.metadata["tags"]) + ): + for tag in cell.metadata["tags"]: + if tag in ["note", "tip", "important", "warning", "caution"]: + head = f"> [!{tag.upper()}]\n" + cell.source = head + "\n".join( + "> " + line + for line in cell.source.splitlines() + if not line.startswith(f"## {tag.capitalize()}") + ) + if self.auto_pre_req and index == 1 and cell.cell_type == "markdown": + cell.source = self.add_auto_prereqs() + "\n" + cell.source + return cell, resources + + +class HTMLFormatter: + def __init__(self, content, **kwargs): + self.content = content + self.attributes = kwargs + self.bs_html = None + self.resource_images_path_dict = {} + self.resources = self.attributes.get("resources", None) + self.input_dir = self.attributes.get("input_dir", None) + self.notebook_path = self.attributes.get("notebook_path", None) + self.output_img_dir = self.attributes.get("output_img_dir", None) + self.output_file = self.attributes.get("output_file", None) + + def parse_html(self): + extensions = ["markdown.extensions.tables", "markdown.extensions.fenced_code"] + html_str = markdown.markdown(self.content, extensions=extensions) + input_format = self.attributes.get("input_format", None) + features = {"rst": "html.parser", "ipynb": None}.get(input_format, None) + self.bs_html = BeautifulSoup(html_str, features=features) + + def manage_images(self): + self.process_resource_images() + for img in self.bs_html.find_all("img"): + img_path = img.get("src") + if img_path.startswith("http"): + img_path_rel = self.process_external_images( + img_path, output_img_dir=self.output_img_dir + ) + else: + img_path_rel = self.process_local_images(img_path) + img["src"] = img_path_rel + if not img.get("alt"): + img["alt"] = img_path_rel.split("/")[-1].split(".")[0].replace("-", " ") + self._replace_img_tag(img, img_path_rel) + + def process_resource_images(self): + if self.resources: + for img_filename, content in self.resources.get("outputs", {}).items(): + img_path = os.path.join( + self.output_img_dir, img_filename.replace("_", "-") + ) + with open(img_path, "wb") as img_file: + img_file.write(content) + img_path_rel = os.path.relpath( + img_path, os.path.dirname(self.output_file) + ).replace(os.sep, "/") + self.resource_images_path_dict[img_filename] = img_path_rel + + def process_local_images(self, img_loc): + # From Resources + if img_loc in self.resource_images_path_dict: + return self.resource_images_path_dict[img_loc] + img_filename = sentence_to_snake(img_loc.split("/")[-1]).replace("_", "-") + file_folder = "/".join( + self.notebook_path.split("/")[:-1] + ) # path read from manifest file + img_input_path = os.path.join(self.input_dir, file_folder, img_loc).replace( + "/", os.sep + ) + if not os.path.exists(img_input_path): + raise ValueError( + f"Could not get image from {img_loc} from {img_input_path}" + ) + img_path = os.path.join(self.output_img_dir, img_filename) + img_path_rel = os.path.relpath( + img_path, os.path.dirname(self.output_file) + ).replace(os.sep, "/") + shutil.copy(img_input_path, img_path) + return img_path_rel + + def process_external_images(self, img_loc, output_img_dir): + if self._is_valid_url(img_loc): + # downloaded image + response = requests.get(img_loc) + if response.status_code == 200: + img_filename = sentence_to_snake(img_loc.split("/")[-1]) + if not os.path.exists(output_img_dir): + os.makedirs(output_img_dir) + img_path = os.path.join(output_img_dir, img_filename) + with open(img_path, "wb") as img_file: + img_file.write(response.content) + img_path_rel = os.path.relpath( + img_path, os.path.dirname(self.output_file) + ).replace(os.sep, "/") + return img_path_rel + else: + raise ValueError(f"Could not download image from {img_loc}") + + def _is_valid_url(self, url): + try: + result = urlparse(url) + return all([result.scheme, result.netloc]) + except: + return False + + def _replace_img_tag(self, img_tag, img_path_rel): + img_name = img_path_rel.split("/")[-1].split(".")[0].replace("-", " ") + img_tag.replace_with( + f':::image type="content" source="{img_path_rel}" ' + f'alt-text="{img_tag.get("alt", img_name)}":::' + ) + + def run(self): + self.parse_html() + self.manage_images() + + +def compare_doc(fabric_file_path, generated): + if fabric_file_path: + with open(fabric_file_path, "r") as f: + md_content = f.readlines() + differ = difflib.Differ() + diff = differ.compare(md_content, generated.splitlines()) + diff_with_row_numbers = [ + (line[0], line[2:]) + for line in diff + if line.startswith("+") or line.startswith("-") + ] + diff_with_row_numbers = [ + (line[0], line[1], index + 1) + for index, line in enumerate(diff_with_row_numbers) + ] + return "\n".join( + f"{symbol} {line} (row {row_num})" + for symbol, line, row_num in diff_with_row_numbers + ) + + +def sentence_to_snake(path: str): + return ( + path.lower() + .replace(" - ", "-") + .replace("_", "-") + .replace(" ", "-") + .replace(",", "") + .replace(".ipynb", "") + .replace(".rst", "") + ) diff --git a/tools/docgen/docgen/manifest.yaml b/tools/docgen/docgen/manifest.yaml index 96848562c6..77302a46d8 100644 --- a/tools/docgen/docgen/manifest.yaml +++ b/tools/docgen/docgen/manifest.yaml @@ -4,26 +4,31 @@ channels: output_dir: "../../../website/docs/" - name: channels.FabricChannel input_dir: ../../../docs/ - output_dir: ../../../target/fabric-docs-pr/ + output_dir: ../../../target/fabric-docs-pr/ + output_structure: flat # flat / hierarchy + auto_pre_req: true notebooks: - path: Explore Algorithms/AI Services/Multivariate Anomaly Detection.ipynb + filename: multivariate-anomaly-detection metadata: title: Analyze time series - description: Use SynapseML and Azure Cognitive Services for multivariate anomaly detection. + description: Use SynapseML and Azure AI services for multivariate anomaly detection. ms.topic: overview - ms.custom: build-2023 - ms.reviewer: jessiwang + ms.custom: "\n - build-2023\n - ignite-2023" + ms.reviewer: fsolomon author: JessicaXYWang ms.author: jessiwang - path: Explore Algorithms/AI Services/Overview.ipynb + filename: how-to-use-ai-services-with-synapseml metadata: - title: Cognitive Services in Azure Synapse Analytics + title: Azure AI Services Overview description: Enrich your data with artificial intelligence (AI) in Azure Synapse Analytics using pretrained models from Azure Cognitive Services. ms.topic: overview ms.reviewer: jessiwang author: JessicaXYWang ms.author: jessiwang - path: Explore Algorithms/Anomaly Detection/Quickstart - Isolation Forests.ipynb + filename: isolation-forest-multivariate-anomaly-detection metadata: title: Outlier and Anomaly Detection description: Use SynapseML on Apache Spark for multivariate anomaly detection with Isolation Forest model. @@ -33,6 +38,7 @@ channels: author: JessicaXYWang ms.author: jessiwang - path: Explore Algorithms/Causal Inference/Quickstart - Measure Causal Effects.ipynb + filename: synapseml-measure-causal-structure metadata: title: Causal Structure description: Causal Structure @@ -44,22 +50,22 @@ channels: - path: Explore Algorithms/Classification/Quickstart - SparkML vs SynapseML.ipynb filename: classification-before-and-after-synapseml metadata: - title: Classification - before and after SynapseML + title: Classification using SynapseML description: Perform the same classification task with and without SynapseML. ms.topic: how-to ms.custom: build-2023 ms.reviewer: jessiwang author: JessicaXYWang ms.author: jessiwang - - path: Explore Algorithms/Deep Learning/Quickstart - Fine-tune a Text Classifier.ipynb - metadata: - title: Train a Text Classifier - description: Train a Text Classifier - ms.topic: overview - ms.custom: build-2023 - ms.reviewer: jessiwang - author: JessicaXYWang - ms.author: jessiwang + # - path: Explore Algorithms/Deep Learning/Quickstart - Fine-tune a Text Classifier.ipynb + # metadata: + # title: Train a Text Classifier + # description: Train a Text Classifier + # ms.topic: overview + # ms.custom: build-2023 + # ms.reviewer: jessiwang + # author: JessicaXYWang + # ms.author: jessiwang - path: Explore Algorithms/Deep Learning/Quickstart - ONNX Model Inference.ipynb filename: onnx-overview metadata: @@ -71,23 +77,26 @@ channels: author: JessicaXYWang ms.author: jessiwang - path: Explore Algorithms/Hyperparameter Tuning/Quickstart - Random Search.ipynb + filename: hyperparameter-tuning-fighting-breast-cancer metadata: title: Hyperparameter tuning description: Identify the best combination of hyperparameters for your chosen classifiers with SynapseML. ms.topic: overview - ms.custom: build-2023 + ms.custom: "\n - build-2023\n - ignite-2023" ms.reviewer: jessiwang author: JessicaXYWang ms.author: jessiwang - - path: Explore Algorithms/LightGBM/Quickstart - Classification, Ranking, and Regression.ipynb - metadata: - title: LightGBM Overview - description: build LightGBM model with SynapseML - ms.topic: overview - ms.reviewer: mopeakande - author: JessicaXYWang - ms.author: jessiwang + # - path: Explore Algorithms/LightGBM/Quickstart - Classification, Ranking, and Regression.ipynb + # filename: + # metadata: + # title: LightGBM Overview + # description: build LightGBM model with SynapseML + # ms.topic: overview + # ms.reviewer: mopeakande + # author: JessicaXYWang + # ms.author: jessiwang - path: Explore Algorithms/OpenAI/OpenAI.ipynb + filename: open-ai metadata: title: Azure OpenAI for big data description: Use Azure OpenAI service to solve a large number of natural language tasks through prompting the completion API. @@ -97,6 +106,7 @@ channels: author: JessicaXYWang ms.author: jessiwang - path: Explore Algorithms/OpenAI/Quickstart - Understand and Search Forms.ipynb + filename: create-a-multilingual-search-engine-from-forms metadata: title: Build a Search Engine description: Build a custom search engine and question-answering system with SynapseML. @@ -104,7 +114,7 @@ channels: ms.custom: build-2023 ms.reviewer: jessiwang author: JessicaXYWang - ms.author: JessicaXYWang + ms.author: jessiwang - path: Explore Algorithms/Other Algorithms/Quickstart - Exploring Art Across Cultures.ipynb filename: conditional-k-nearest-neighbors-exploring-art metadata: @@ -116,6 +126,7 @@ channels: author: JessicaXYWang ms.author: jessiwang - path: Explore Algorithms/Responsible AI/Tabular Explainers.ipynb + filename: tabular-shap-explainer metadata: title: Interpretability - Tabular SHAP explainer description: Use Kernel SHAP to explain a tabular classification model. @@ -125,11 +136,29 @@ channels: author: JessicaXYWang ms.author: jessiwang - path: Get Started/Quickstart - Your First Models.ipynb + filename: synapseml-first-model metadata: - title: SynapseMl first model + title: SynapseML first model description: A quick introduction to building your first machine learning model with SynapseML. ms.topic: how-to - ms.custom: build-2023 + ms.custom: "\n - build-2023\n - ignite-2023" ms.reviewer: mopeakande author: JessicaXYWang - ms.author: jessiwang \ No newline at end of file + ms.author: jessiwang + - name: channels.AzureChannel + input_dir: ../../../docs/ + output_dir: ../../../target/azure-docs-pr/ + output_structure: flat + auto_pre_req: false + notebooks: + - path: Explore Algorithms/AI Services/Overview.ipynb + filename: overview-cognitive-services + metadata: + title: Azure AI services in Azure Synapse Analytics + description: Enrich your data with artificial intelligence (AI) in Azure Synapse Analytics using pretrained models from Azure AI services. + ms.service: synapse-analytics + ms.subservice: machine-learning + ms.topic: overview + ms.reviewer: sngun, garye, negust, ruxu, jessiwang + author: WilliamDAssafMSFT + ms.author: wiassaf \ No newline at end of file