From 74885302c4f8382992b964615b0b989a74578ddc Mon Sep 17 00:00:00 2001
From: Jenny So <j-so@users.noreply.github.com>
Date: Wed, 23 Sep 2020 16:37:39 -0700
Subject: [PATCH] Fix Batch Scoring docs (#333)

* docs

* more fixes
---
 data/README.md          |  2 +-
 docs/custom_model.md    | 29 +++++++++++++++++++++++++++++
 docs/getting_started.md | 14 ++++----------
 3 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/data/README.md b/data/README.md
index a25aa451..d43d139c 100644
--- a/data/README.md
+++ b/data/README.md
@@ -1,3 +1,3 @@
 This folder is used for example data, and it is not meant to be used for storing training data.
 
-Follow steps to [Configure Training Data]('docs/custom_model.md#configure-training-data.md') to use your own data for training.
\ No newline at end of file
+Follow steps to [Configure Training Data](../docs/custom_model.md#Configure-Custom-Training) to use your own data for training.
\ No newline at end of file
diff --git a/docs/custom_model.md b/docs/custom_model.md
index d21c8b8d..a554f376 100644
--- a/docs/custom_model.md
+++ b/docs/custom_model.md
@@ -10,6 +10,7 @@ This document provides steps to follow when using this repository as a template
 1. [Optional] Update the evaluation code
 1. Customize the build agent environment
 1. [If appropriate] Replace the score code
+1. [If appropriate] Configure batch scoring data
 
 ## Follow the Getting Started guide
 
@@ -35,6 +36,8 @@ To bootstrap from the existing MLOpsPython repository:
     * `[dirpath]` is the absolute path to the root of the directory where MLOpsPython is cloned
     * `[projectname]` is the name of your ML project
 
+# Configure Custom Training
+
 ## Configure training data
 
 The training ML pipeline uses a [sample diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html) as training data.
@@ -83,6 +86,8 @@ The DevOps pipeline definitions in the MLOpsPython template run several steps in
 * Create a new Docker image containing your dependencies. See [docs/custom_container.md](custom_container.md). Recommended if you have a larger number of dependencies, or if the overhead of installing additional dependencies on each run is too high.
 * Remove the container references from the pipeline definition files and run the pipelines on self hosted agents with dependencies pre-installed.
 
+# Configure Custom Scoring
+
 ## Replace score code
 
 For the model to provide real-time inference capabilities, the score code needs to be replaced. The MLOpsPython template uses the score code to deploy the model to do real-time scoring on ACI, AKS, or Web apps.
@@ -92,3 +97,27 @@ If you want to keep scoring:
 1. Update or replace `[project name]/scoring/score.py`
 1. Add any dependencies required by scoring to `[project name]/conda_dependencies.yml`
 1. Modify the test cases in the `ml_service/util/smoke_test_scoring_service.py` script to match the schema of the training features in your data
+
+# Configure Custom Batch Scoring
+
+## Configure input and output data
+
+The batch scoring pipeline is configured to use the default datastore for input and output. It will use sample data for scoring.
+
+In order to configure your own input datastore and output datastores, you will need to specify an Azure Blob Storage Account and set up input and output containers.
+
+Configure the variables below in your variable group. 
+
+**Note: The datastore storage resource, input/output containers, and scoring data is not created automatically. Make sure that you have manually provisioned these resources and placed your scoring data in your input container with the proper name.**
+
+
+| Variable Name            | Suggested Value           | Short description                                                                                                           |
+| ------------------------ | ------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
+| SCORING_DATASTORE_STORAGE_NAME    |                  | [Azure Blob Storage Account](https://docs.microsoft.com/en-us/azure/storage/blobs/) name.                                     |
+| SCORING_DATASTORE_ACCESS_KEY      |                  | [Azure Storage Account Key](https://docs.microsoft.com/en-us/rest/api/storageservices/authorize-requests-to-azure-storage). You may want to consider linking this variable to Azure KeyVault to avoid storing the access key in plain text. |
+| SCORING_DATASTORE_INPUT_CONTAINER |                  | The name of the container for input data. Defaults to `input` if not set.  |
+| SCORING_DATASTORE_OUTPUT_CONTAINER|                  | The name of the container for output data. Defaults to `output` if not set.  |
+| SCORING_DATASTORE_INPUT_FILENAME  |                  | The filename of the input data in your container Defaults to `diabetes_scoring_input.csv` if not set.  |
+| SCORING_DATASET_NAME              |                  | The AzureML Dataset name to use. Defaults to `diabetes_scoring_ds` if not set (optional).  |
+| SCORING_DATASTORE_OUTPUT_FILENAME |                  | The filename to use for the output data. The pipeline will create this file. Defaults to `diabetes_scoring_output.csv` if not set (optional).  |
+
diff --git a/docs/getting_started.md b/docs/getting_started.md
index a59cad0a..7a311cf8 100644
--- a/docs/getting_started.md
+++ b/docs/getting_started.md
@@ -64,9 +64,8 @@ The variable group should contain the following required variables. **Azure reso
 | WORKSPACE_NAME           | mlops-AML-WS              | Azure ML Workspace name                                                                                                     |
 | AZURE_RM_SVC_CONNECTION  | azure-resource-connection | [Azure Resource Manager Service Connection](#create-an-azure-devops-service-connection-for-the-azure-resource-manager) name |
 | WORKSPACE_SVC_CONNECTION | aml-workspace-connection  | [Azure ML Workspace Service Connection](#create-an-azure-devops-azure-ml-workspace-service-connection) name                 |
-| ACI_DEPLOYMENT_NAME      | mlops-aci                 | [Azure Container Instances](https://azure.microsoft.com/en-us/services/container-instances/) name                           |
-| SCORING_DATASTORE_STORAGE_NAME      | [your project name]scoredata                 | [Azure Blob Storage Account](https://docs.microsoft.com/en-us/azure/storage/blobs/) name (optional)                          |
-| SCORING_DATASTORE_ACCESS_KEY      |                  | [Azure Storage Account Key](https://docs.microsoft.com/en-us/rest/api/storageservices/authorize-requests-to-azure-storage) (optional)                          |
+| ACI_DEPLOYMENT_NAME      | mlops-aci                 | [Azure Container Instances](https://azure.microsoft.com/en-us/services/container-instances/) name                           |                 |
+
 
 Make sure you select the **Allow access to all pipelines** checkbox in the variable group configuration.
 
@@ -88,10 +87,6 @@ More variables are available for further tweaking, but the above variables are a
 
 **ACI_DEPLOYMENT_NAME** is used for naming the scoring service during deployment to [Azure Container Instances](https://azure.microsoft.com/en-us/services/container-instances/).
 
-**SCORING_DATASTORE_STORAGE_NAME** is the name for an Azure Blob Storage account that will contain both data used as input to batch scoring, as well as the batch scoring outputs. This variable is optional and only needed if you intend to use the batch scoring facility. Note that since this resource is optional, the resource provisioning pipelines mentioned below do not create this resource automatically, and manual creation is required before use.
-
-**SCORING_DATASTORE_ACCESS_KEY** is the access key for the scoring data Azure storage account mentioned above. You may want to consider linking this variable to Azure KeyVault to avoid storing the access key in plain text. This variable is optional and only needed if you intend to use the batch scoring facility. 
-
 
 ## Provisioning resources using Azure Pipelines
 
@@ -295,11 +290,10 @@ The pipeline stages are summarized below:
   - If run locally without the model version, the batch scoring pipeline will use the model's latest version.
 - Trigger the *ML Batch Scoring Pipeline* and waits for it to complete.
   - This is an **agentless** job. The CI pipeline can wait for ML pipeline completion for hours or even days without using agent resources.
-- Use the scoring input data supplied via the SCORING_DATASTORE_INPUT_* configuration variables.
+- Use the scoring input data supplied via the SCORING_DATASTORE_INPUT_* configuration variables, or uses the default datastore and sample data.
 - Once scoring is completed, the scores are made available in the same blob storage at the locations specified via the SCORING_DATASTORE_OUTPUT_* configuration variables.
 
-**Note** In the event a scoring data store is not yet configured, you can still try out batch scoring by supplying a scoring input data file within the data folder. Do make sure to set the SCORING_DATASTORE_INPUT_FILENAME variable to the name of the file. This approach will cause the score output to be written to the ML workspace's default datastore. 
-
+To configure your own custom scoring data, see [Configure Custom Batch Scoring](custom_model.md#Configure-Custom-Batch-Scoring).
 
 ## Further Exploration