Skip to content

euchungmsft/data-platform-migration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadoop Migration on Azure PaaS

Note: This is part of Enabling Hadoop Migrations to Azure reference implementation. For more information check out the [readme file in the root.] (https://github.com/Azure/Hadoop-Migrations/blob/main/README.md)

One of the challenges while migrating workloads from on-premises Hadoop to Azure is having the right deployment done which is aligning with the desired end state architecture and the application. With this bicep project we are aiming to reduce a significant effort which goes behind deploying the PaaS services on Azure and having a production ready architecture up and running.

We will be looking at the end state architecture for big data workloads on Azure PaaS listing all the components deployed as a part of bicep template deployment. With Bicep we also have an additional advantage of deploying only the modules we prefer for a customised architecture. In the later sections we will cover the pre-requisites for the template and different methods of deploying the resources on Azure such as Oneclick, Azure CLI, Github Actions and DevOps Pipeline.

Reference Architecture Deployment

By default, all the services which come under the reference architecture are enabled, and you must explicitly disable services that you don't want to be deployed from parameters which prompts in the ARM screen at portal or in the template files *.parameters.json or directly in *.bicep

Note: Before deploying the resources, we recommend to check registration status of the required resource providers in your subscription. For more information, see Resource providers for Azure services.

Reference Architecture - Modernization, Concept

Based on the detailed architecture above the end state deployment is simplified below for better understanding.

Deployment Architecture

For the reference architecture, the following services are created

For more details regarding the services that will be deployed, please read the Domains guide(link) in the Hardoop Migration documentation.

Before you start

If you don't have an Azure subscription, create your Azure free account today.

Prerequisites

  1. Azure CLI
  2. Bicep

In this quickstart, you will create:

  1. Resource Group
  2. Service Principal and access
  3. Public Key for SSH (Optional)

1. Resource Group

The Azure CLI's default authentication method uses a web browser and access token to sign in.

Run the login command

az login

Once the authentication is successful, you should see a similar output:

  {
    "cloudName": "AzureCloud",
    "homeTenantId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "isDefault": true,
    "managedByTenants": [],
    "name": "xxxxxxxxxxxx",
    "state": "Enabled",
    "tenantId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "user": {
      "name": "[email protected]",
      "type": "user"
    }
  },

Copy the subscription id from output above, you will need it to create more resources.

Create a resource group using the below command

az group create -l <Your Region> -n <Resource Group Name> --subscription <Your Subscription Id>

2. Service Principal and access

An Azure service principal is an identity created for use with applications, hosted services, and automated tools to access Azure resources programatically. It needs to be generated for authentication and authorization by Key Vault. Get the subscription id from the output saved earlier. Open Cloud shell or Azure CLI, set the Azure context and execute the following commands to generate the required credentials:

Note: The purpose of this new Service Principal is to assign least-privilege rights. Therefore, it requires the Contributor role at a resource group scope in order to deploy the resources inside the resource group dedicated to a specific data domain. The Network Contributor role assignment is required as well in this repository in order to assign the resources to the dedicated subnet.

**** to-be-updated ****

az ad sp create-for-rbac -n <Your App Name>

You should see the output similar to the following

{
  "appId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "displayName": "<Your App Name>",
  "name": "http://<Your App Name>",
  "password": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}

Save the appId and password for the upcoming steps.

3. Public Key for SSH

This is an optional step, follow when you want to deploy VMs at VNets for testing purpose: This article shows you how to quickly generate and use an SSH public-private key file pair for Linux VMs

To create and use an SSH public-private key pair for Linux VMs in Azure

cat ~/.ssh/id_rsa.pub

Save the output of the above command as the public key and store in a safe location, to be used in the upcoming commands.

Supported Regions

Most of Azure regions have all majority data & analytics services available, some of them are given below:

  • Canada Central
  • Canada East
  • Central US
  • East US
  • East US 2
  • North Central US
  • South Central US
  • West Central US
  • West US
  • West US 2

Deployment methods

There are 4 methods available for deploying this reference architecture, let's look at each one individually:

  1. Oneclick button to Quickstart
  2. CLI
  3. Github Action
  4. Azure DevOps Pipeline

1. Quickstart Button

  • Infrastructure

Deploy To Azure Visualize

  • Key Vault

Deploy To Azure Visualize

  • Services all-at-once

Deploy To Azure Visualize

2. Deploying using CLI

Doublecheck if you've logged in.

az login

Clone this repo to your environment

git clone https://github.com/nudbeach/data-platform-migration.git

You can run all following commands at home directory of data-platform-migration Create a resource group with location using your subscription id from previous step

az group create -l koreacentral -n <Your Resource Group Name> \
 --subscription xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Deploy components by running these commands sequentially

az deployment group create -g <Your Resource Group Name> -f main/main-infra.bicep

az deployment group create -g <Your Resource Group Name> -f main/main-keyvault.bicep

az deployment group create -g <Your Resource Group Name> -f main/main-service-all-at-once.bicep

or

az deployment group create -g <Your Resource Group Name> \
 -f main/main-infra.bicep \
 --parameter main/main-service-infra.json

az deployment group create -g <Your Resource Group Name> \
 -f main/main-service-keyvault.bicep \
 --parameter main/main-service-keyvault.json

az deployment group create -g <Your Resource Group Name> \
 -f main/main-service-all-at-once.bicep \
 --parameter main/main-service-all-at-once.parameters.json

--parameter <parameter filename> is optional

or after you run ./build.sh from command line,

az deployment group create -g <Your Resource Group Name> \
 -f build/main-infra.json \
 --parameter build/main-service-infra.json

az deployment group create -g <Your Resource Group Name> \
 -f build/main-service-keyvault.json \
 --parameter build/main-service-keyvault.json

az deployment group create -g <Your Resource Group Name> \
 -f build/main-service-all-at-once.json \
 --parameter build/main-service-all-at-once.parameters.json

3. Deploying using Github Action with automation

This option consists of 3 steps

  1. Role assignments to Service Principal
  2. Setting up AZURE_CREDENTIAL
  3. Pipeline implementation
  4. Running Workflow

1. Role assignments to Service Principal

In the previous step, you've already got a Service Principal which's for Key Vault. But we're going to create another one for client authentication backed by Azure AD which's dedicated to GitHub Action and Azure DevOps Pipeline

az ad sp create-for-rbac --name <Your App Name -2> --role contributor \
 --scopes <Your Resource Group Id> \
 --role contributor --sdk-auth

<Your App Name -2> is different name to previous one. Then you'll get something like this

{
  "clientId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "clientSecret": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "subscriptionId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "tenantId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "activeDirectoryEndpointUrl": "https://login.microsoftonline.com",
  "resourceManagerEndpointUrl": "https://management.azure.com/",
  "activeDirectoryGraphResourceId": "https://graph.windows.net/",
  "sqlManagementEndpointUrl": "https://management.core.windows.net:8443/",
  "galleryEndpointUrl": "https://gallery.azure.com/",
  "managementEndpointUrl": "https://management.core.windows.net/"
}

Keep entire Jason document for next step. And it has to be assigned all required role to this brand new service principal

  • Contributor
  • Private DNS Zone Contributor
  • Network Contributor
  • User Access Administrator

User Access Administrator is for role assignment for data platform services to storage account.

az role assignment create \
 --assignee "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \
 --role "<Role Name>" \
 --resource-group <Your Resource Group>

Run this commands for required roled listed above to your resource groups separately. Or run this command from your command line

./roleassign.sh <Your App Name -2> <Your Resource Group Name>

2. Setting up AZURE_CREDENTIAL

Next step is now you need to let GitHub Action authenticate for all access to your resources. This is simple. Before you do this, I recommend you to fork the repo under your GitHub account so that you can easily update actions From Setting menu on the repo, goto secrets, and click on 'New Repository Secret'. And put the name as AZURE_CREDENTIAL, paste entire Jason document to value which you got from previous step. And 'Add Secrete' Button at below. From now on, all acces to your resources on Azure will be authenticated by using this token

Save AZURE_CREDENTIAL

3. Pipeline implementation

Checkout your repo which forked from data-platform-migration and go to .github/workflows. Open the workflow file, 'deployment-hdmi001.yml' and simply update environment variables with yours

  • AZURE_SUBSCRIPTION_ID
  • AZURE_RESOURCE_GROUP_NAME
  • AZURE_LOCATION

You can externalize these environment variables to env file. See this for further details

Run this command from your command line to update to your repo

git add . ; git commit -m "my first commit" ; git push

In this example, it only manually works by workflow_dispatch event, it never automatically runs by pull and push events. You can do this by un-remarking initial parts like this

# Controls when the action will run. 
on:
  # Triggers the workflow on push or pull request events but only for the main branch
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

4. Running Workflow

Go to 'Actions' item on your repo, and click on 'Deployment for Project HDMI001'. And click on 'Run Workflow' on the right

Run Workflow

Now you can see the running workflow

Running Workflow

4. Deploying using Azure DevOps with automation

Both GitHub Action and Azure DevOps looks similar in terms of structure and concepts but not 100% the same each other.Comparing with instructions in GitHub Action, you can reuse most of things that you've created at the intial steps 'Role assignments to Service Principal' and 'Setting up AZURE_CREDENTIAL'. For 'Pipeline implementation' step, there's a few differences in syntax of workflow, for example on, env are not supported in Azure DevOps, you can just remove them and externalize them to "Environment" and "Variables". So I would recommend you to make a repo copy to Azure DevOps before you start

  1. Create Azure ARM connection
  2. Configure your Pipeline
  3. Run the pieline

1. Create Azure ARM connection

First of all, create and get your project and select it. From "Project Settings" at the bottom left, go to "Service Connection" and click on "New Service Connection" at top right

In the "New Service Connection" tab, select "Azure Resource Manager", and select "Next"

Select "Service principal (manual)" and clock on "Next"

New Azure Service Connection

From "New Azure service connection" tab, select "Subscription", input your Subscriotion id and Subscriotion id, Enter details with service principal that we have generated in GitHub Actionconfig.

  • Service Principal Id = clientId
  • Service Principal Key = clientSecret
  • Tenant ID = tenantId

And click on Verify to make sure that the connection works.

Verify Settings

Input your user-friendly connection name to use when referring to this service connection and description if you want. Take note of the name because this will be required in the parameter update process.

That's it

2. Configure your Pipeline

Go to Pipelines, click on "New Pipeline" button on top right.

Select "Azure Repo Git" if you've already made clone to your repo, or you can select Github to connect to your repo which's forked from this repo

Configure Your Pipeline

Select "Existing Azure Pipelines YAML file", then you'll see the Github Action workflow file under .github/workflows, in the next step, you can make some changes directly on the web UI so that you can configure your pipeline and try to test it

Known issues

Warning: no-unused-params

Warning Message:

data-platform-migration/modules/create-vnets-with-peering/azuredeploy.bicep(25,7) : Warning no-unused-params: Parameter is declared but never used. [https://aka.ms/bicep/linter/no-unused-params]
data-platform-migration/modules/create-vnets-with-peering/azuredeploy.bicep(28,7) : Warning no-unused-params: Parameter is declared but never used. [https://aka.ms/bicep/linter/no-unused-params]
data-platform-migration/modules/create-private-dns-zone/azuredeploy.bicep(11,7) : Warning no-unused-params: Parameter is declared but never used. [https://aka.ms/bicep/linter/no-unused-params]
data-platform-migration/modules/create-private-dns-zone/azuredeploy.bicep(14,7) : Warning no-unused-params: Parameter is declared but never used. [https://aka.ms/bicep/linter/no-unused-params]
data-platform-migration/modules/create-vm-simple-linux/azuredeploy.bicep(19,7) : Warning no-unused-params: Parameter is declared but never used. [https://aka.ms/bicep/linter/no-unused-params]

Solution: Simply ignore these warnings. It's because of optional settings for VM, HDI, Synapse and so on. If you set these false, all corresponding parameters for these creations are not gonna get used. You can just ignore these when you try this with CLI or Quick Start Button but in Git Hub Action or Azure DevOps pipeline, you need to skip this warnings by adding continue-on-error: true to Jobs or Steps. That's because current version of deployment agent in Azure CLI detects it's an error than warning. It's Bicep CLI version 0.3.255 (589f0375df) for now.

The warning is to guide to reduce confusion in your template, delete any parameters that are defined but not used. This test finds any parameters that aren't used anywhere in the template. Eliminating unused parameters also makes it easier to deploy your template because you don't have to provide unnecessary values. You can find further details from here

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

[//]: # (This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.)

Releases

No releases published

Packages

No packages published