This solution is to help build out a process to collect data contained in an Azure storage account, internal or external. These PowerShell scripts will build out all the the components of this architecture using a Synapse Analytics workspace, Azure Data Lake storage, and an Azure Key Vault. It will also build out the parameter driven pipelines to automate the sharing of the data (example uses CSVs) across environments. GIANT SECURITY DISCLAIMER: Due to what this does, it's not the most secure way to setup an environment!! You need to verify that you're going to be allowed to do this in your Azure environment and potentially plan the location based on policies/etc. You will need the connection string for the storage account to pull data from so there will need to clearance and blessings on both sides of this solution. The connection string contains the storage keys which is the highest level of access you can grant at a full storage level. You may need to setup separate subscriptions/storage accounts that aren't under restrictions or policies such as HIPPA/etc. PROCEED WITH CAUTION AND MAKE SURE YOU'VE GOTTEN APPROVAL FROM YOUR AZURE ADMINS AND DATA GOVERNANCE ENTITIES!
The architecture of the solution diagrammed below.
- This full solution will create an Azure Synapse Analytics workspace, 2 ADLS Gen 2 storage accounts, and an Azure Key Vault.
- This also builds a parameterized pipeline that can reach into storage accounts to grab csv files and process them. It will process them and land them into a consolidated ADLS Gen 2 data lake storage account into parquet files.
- It utilizes Azure Key Vault secrets that contain the storage account connection strings (connection info containing storage keys - https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#regenerate-access-keys)
- The example CSV files can be landed into any storage account that you have access to the storage keys. You can land them in any location you'd like, just note the location.
- You can run the pipeline passing in parameters detailing the secret name, container, filename prefix, and further directory pathing if needed.
The PowerShell scripts will create all the items below and is driven by values with the paramfile. You will only need to edit the paramfile03.json file. The table below details all the components the solution will create and it's corresponding variable value in the paramfile.
Azure Item | paramfile value | Details |
---|---|---|
Azure Resource Group | resourceGroupName | All the components are built within 1 resource group |
Azure Synapse Analytics workspace | azsynapsename | Synapse Analytics workspace name |
Azure Data Lake Gen 2 | azstoragename | ADLS Gen 2 for system use of Synapse workspace |
Azure Data Lake Gen 2 | azstoragename2 | ADLS Gen 2 to land processed parquet files |
Azure Key Vault | akvname | Azure Key Vault to store connection string info in secret |
3 Synapse Linked Services | LinkedServiceNameX | Linked Services to use in the pipeline |
2 Synapse datasets | DatasetNameX | Synapse Datasets to use in the pipeline |
1 Synapse pipeline | PipelineName1 | Synapse parameter driven pipeline |
-
Download the files contained within DataShare Full Sample locally to your machine.
-
Update the paramfile03.json with the values and naming you want and save.
-
Save CSV files in appropriate source Azure storage locations you want to use to pull data from. Capture the text of the connection string for later use running the scripts.
-
Run the PowerShell scripts you need. Below are the commands you'll run using Azure CLI to build the full solution. (blog on setting up Azure CLI to run scripts https://hopefoley.com/2021/09/27/powershell-in-the-clouds/)
- Build all the pieces in Azure (You'll be prompted a few times for information. 1 - user and password for SQL Admin account for Synapse. 2 - text for connection string to storage account. This will build 2 secrets you can use for the pipeline.):
./"01 - Create Resources DataShare.ps1" -filepath ./paramfile03.json
- Grant rights for storage and AKV:
./"02 - GrantRightsDataShareCLI.ps1" -filepath ./paramfile03.json
- Build the Synapse pipeline components:
./"03 - Create Pipeline Parts DataShare.ps1" -filepath ./paramfile03.json
- Build all the pieces in Azure (You'll be prompted a few times for information. 1 - user and password for SQL Admin account for Synapse. 2 - text for connection string to storage account. This will build 2 secrets you can use for the pipeline.):
-
Run the pipeline with appropriate parameter values. You will also want to update the Synapse workspace firewall from Azure CLI to your current client IP. DisClient firewall rule is setup during automation to allow Azure CLI client to access Synapse for creation of components.
Parameter Name | Value details |
---|---|
ParamSecret | Name of the secret containing connection string |
Container | Name of container holding CSV files |
FilePrefix | All CSVs that begin with this text will be processed |
DirectoryPath | directory path (i.e. foldername1/foldername2/) |
There is a script you can use to help check for these pre-reqs and install them called 00 - PreReqCheck.ps1. Navigate to DataShare Full Sample for more details on running the script.
- Open that file in your preferred PowerShell IDE (administrator mode is required to install anything).
- PowerShell 5.1 or above needs installed locally. You can check this by running the following script. $PSVersionTable.PSVersion
- Powershell modules needed - the latest versions I tested are listed. 00 - PreReqCheck.ps1 has syntax to install them.
- Az.Resources (2.2.0) - https://docs.microsoft.com/en-us/powershell/module/az.resources
- Az.Storage (3.10.0) - https://docs.microsoft.com/en-us/powershell/module/az.storage
- Az.Synapse (0.13.0) - https://docs.microsoft.com/en-us/powershell/module/az.synapse
- Az.KeyVault (1.0.0) - https://docs.microsoft.com/en-us/powershell/module/az.keyvault