Utilisation of Azure Cloud Services to architect and orchestrate data pipeline(weekly) to perform ETL on Covid-19 dataset of European countries extracted from European Centre for Disease Prevention and Control
Overview • Tools • Architecture • Support • License
The European Centre for Disease Prevention and Control (ECDC) was established in 2005. It is an EU agency aimed at strengthening Europe's defenses against infectious diseases.
Covid 19 Analysis is a comprehensive project that harnesses the capabilities of Azure services to collect, analyze, and visualize essential COVID-19 data while ensuring robust security through Azure Key Vault and Azure Service Principals. This project seamlessly retrieves data from the European Centre for Disease Prevention and Control (ECDC) and combines it with population data for a comprehensive analysis of the pandemic's impact. Data is ingested into Azure Data Lake Gen2, which acts as a centralized storage repository, and then undergoes transformations and exploratory analysis using Azure Dataflow and Azure Databricks. To maintain stringent security, Azure Key Vault is employed to securely manage and store sensitive credentials and secrets. Processed data is stored in an Azure SQL Database for efficient querying, and Azure Data Lake Gen2 is used for intermediate and refined datasets. The project includes the use of Power Bi for showcasing the spread and testing of Covid 19 in European countries.
The repository directory structure is as follows:
├── README.md <- The top-level README for developers using this project.
|
├── Data <- Contains data extracted, processed, and used throughout the project.
│ ├── Raw <- Contains raw data folders
│ │
│ ├── Processed <- Contains processed data acquired through databricks spark notebooks and azre data flow.
│ │
│ ├── Lookup <- Contains look up files used for population and country info.
│ │
│ ├── Config <- Contains file used to automate the extraction part for ADF.
│
│
├── Databricks Notebooks <- Scripts to aggregate and transform data
│ ├── configuration <- Contains configurations used for mounting ADLS and azure key vault.
│ │
│ ├── transformation <- Contains transformation notebooks
|
├── Resources <- Resources for readme file.
To build this project, the following tools were used:
- Azure Databricks
- Azure KeyVault
- Azure Active Directory
- Azure DataLake Gen 2
- Azure Data Factory
- Azure SQL Database
- Power Bi
- Pyspark
- SQL
- Git
The architecture of this project is inspired by the following architecture.
If you have any doubts, queries, or suggestions then, please connect with me on any of the following platforms:
This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.