Skip to content

This repository contains everything you need to become proficient in Data Engineering

License

Notifications You must be signed in to change notification settings

GicharuElvis/Complete-Data-Engineering

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 

Repository files navigation

Complete Data Engineering with Projects Series

This repository contains everything you need to become proficient in Data Engineering.

0_SwVJoSmjpR4_aRuA

Pic credits: infra

Complete Cheat Sheet for Tech Interviews - How to prepare efficiently

I took theses Projects Based Courses to Build Industry aligned Data Science and ML skills

Part 1 - How to solve Any ML System Design Problem


Pre-requisite : Day 1 — Day 60 : Quick Recap of 60 days of Data Science and ML


  1. Data Engineering

What's Data Engineering

Why Data Engineering

Data Engineers - ML Engineers -- Data Scientists

Purpose and Scope


  1. Python for Data Engineering

Basic Python with Project

Advanced Python with Project

Techniques to write efficient and optimized code


  1. SQL Basics

Structured Query Language

Query Structure

Conditions

Joins

Stored Procedures


  1. Aggregations

Wild cards

Grouping Data

Aggregation Functions

Filtering

Sequences

Group By, Order By

Having Clause

Write Sub queries

Grouping Sets

Analytical Functions


  1. Window Functions

Row Numbering

Percentile

Advanced windowing techniques


  1. BigQuery

BigQuery Basics

SELECT, FROM, WHERE and Date and Extract in BigQuery

Common Expression Table

UNNEST Clause

SQL vs NoSQL Database


  1. Advanced Functions

Triggers

Pivot

Cursors

Views

Indexes

Auto Increment


  1. Performance Tuning SQL Queries

Query Optimizations in SQL


  1. MySQL, PostgreSQL and MongoDB

Introduction to MySQL

Introduction to PostgreSQL

Introduction to Mongo DB

Comparison between MySQL and PostgreSQL and Mongo DB

Introduction to SQL and NoSQL Databases

MySQL in Depth


  1. Scripting and Automation

Shell Scripting

ETL ( Extract, Tranform and Load) basics

Why ETL is important?

How ETL works

ETL Tools


  1. Relational Databases and SQL

Basic SQL

Advanced SQL


  1. NoSQL Data bases and Map Reduce

Data Warehouses

Data Lakes

Structured Data

Semi Structured Data

Unstructured Data

Data Mart

Map-Reduce


13.Data Analysis

Pandas

Numpy

Advanced Pandas Techniques

Data Pre-processing

Handling missing values

Data Cleaning

Mean/mode/median Imputation

Hot Deck Imputation

Rescale Data

Binarize Data

Regression Imputation

Stochastic regression imputation

Feature Scaling

Data Augmentation

Read and Process Large Datasets

Data Visualization basics

Data Visualization Projects

Data Visualization using Plotly and Bokeh

Data Profiling

Summary Functions

Indexing

Grouping

Linear Regression

Multi Linear Regression

Polynomial Regression

Regression

Support Vector Regression

Decision Tree Regression

Random Forest Regression

Feature Engineering

GroupBy Features

Categorical and Numerical Features

Missing Value Analysis

Fill the missing Values

Unique Value Analysis

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Correlation Analysis

Spearman’s ρ

Pearson’s r

Kendall’s τ

Cramér’s V (φc)

Phik (φk)


  1. Data Processing Techniques

Batch Processing

Stream Processing

Apache Spark

Apache Spark Commands

Apache Kafka

How Apache Kafka works


  1. Big Data

Big Data

Types of Big Data

Big data tools

SQL and NoSQL Databases

Hadoop

Hadoop HDFS

Hadoop Yarn

Hive

Zookeeper

Pig

Cassandra

Sqoop


  1. Data Pipelines and WorkFlows

Data Pipelines

Transformation

Processing

Workflow

Monitoring

Airflow

DAG


  1. Infrastructure

Docker

Docker vs Virtual Machines

Most important Docker commands

Kubernetes

Snowflake


  1. Power BI

Power BI

Which chart to use and When?

Power BI — Data Analysis Expressions

Joins

Data Profiling


  1. Cloud Data Engineering

Data Engineering on cloud

AWS

AWS Services

Google Cloud Platform

Google Cloud Platform services


  1. Machine Learning Algorithms

Linear Regression

Logistic Regression

Decision Trees

Random Forest

Support Vector Machines

K Nearest Neighbors

K means Clustering

Hierarchical Clustering

Neural Networks


Some of the other best Series-

Complete 60 Days of Data Science and Machine Learning Series

30 days of Machine Learning Ops

30 Days of Natural Language Processing ( NLP) Series

Data Science and Machine Learning Research ( papers) Simplified **

30 days of Data Engineering with projects Series

60 days of Data Science and ML Series with projects

100 days : Your Data Science and Machine Learning Degree Series with projects

23 Data Science Techniques You Should Know

Tech Interview Series — Curated List of coding questions

Complete System Design with most popular Questions Series

Complete Data Visualization and Pre-processing Series with projects

Complete Python Series with Projects

Complete Advanced Python Series with Projects

Kaggle Best Notebooks that will teach you the most

Complete Developers Guide to Git

Exceptional Github Repos — Part 1

Exceptional Github Repos — Part 2

All the Data Science and Machine Learning Resources

210 Machine Learning Projects


6 Highly Recommended Data Science and Machine Learning Courses that you MUST take ( with certificate) - 

  1. Complete Data Scientist : https://bit.ly/3wiIo8u

Learn to run data pipelines, design experiments, build recommendation systems, and deploy solutions to the cloud.


  1. Complete Data Engineering : https://bit.ly/3A9oVs5

Learn to design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets


  1. Complete Machine Learning Engineer : https://bit.ly/3Tir8ub

Learn advanced machine learning techniques and algorithms - including how to package and deploy your models to a production environment.


  1. Complete Data Product Manager : https://bit.ly/3QGUtwi

Leverage data to build products that deliver the right experiences, to the right users, at the right time. Lead the development of data-driven products that position businesses to win in their market.


  1. Complete Natural Language Processing : https://bit.ly/3T7J8qY

Build models on real data, and get hands-on experience with sentiment analysis, machine translation, and more.


  1. Complete Deep Learning: https://bit.ly/3T5ppIo

Learn to implement Neural Networks using the deep learning framework PyTorch


About

This repository contains everything you need to become proficient in Data Engineering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published