Jenny_Wu_F24_MP11

Purpose of Project

This project demonstrates the implementation of an ETL (Extract, Transform, Load) and Query pipeline within the Databricks environment. The pipeline is developed using PySpark, extracting raw data, applying transformations, and storing the processed data in Delta Tables for efficient querying and analysis.

Project Overview

The ETL-Query pipeline includes the following steps:

Extract: Raw data is extracted from online data source and saved in local file path.
Transform: Data cleaning, formatting, and enrichment processes are applied using PySpark to make the data analysis-ready.
Load: Both the raw data and the transformed data are stored in Delta Table within the Databricks environment.
Query: The Delta Table serves as the foundation for running SQL queries and performing analysis efficiently.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
__pycache__		__pycache__
data		data
preprocess_SQL_files		preprocess_SQL_files
.DS_Store		.DS_Store
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
cli.py		cli.py
main.py		main.py
nypd_shooting		nypd_shooting
nypd_shooting.db		nypd_shooting.db
requirements.txt		requirements.txt
test nb.py		test nb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jenny_Wu_F24_MP11

Purpose of Project

Project Overview

About

Releases

Packages

Languages

nogibjj/Jenny_Wu_F24_MP11

Folders and files

Latest commit

History

Repository files navigation

Jenny_Wu_F24_MP11

Purpose of Project

Project Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages