diff --git a/README.md b/README.md index bab0be9..80038ec 100644 --- a/README.md +++ b/README.md @@ -35,79 +35,137 @@ By utilizing this pipeline, educators, policymakers, and students can better und ├── .github/ │ └── workflows/cicd.yml ├── data/ -│ ├── airline_safety.csv - └── airline_safety.csv +│ ├── grad-students.csv + └── recent_grads.csv ├── myLib/ │ ├── __init__.py │ ├── __pycache__/ │ ├── extract.py │ ├── query.py │ └── transform_load.py -├── AirlineSafetyDB.db +├── .gitignore ├── main.py ├── Makefile ├── query_log.md -├── query_output.md ├── README.md -├── ServeTimesDB.db requirements.txt +├── requirements.txt └── test_main.py ``` -## CRUD operations : - ## Usage +To run the ETL process or execute queries, use the following commands: -Run the script using the following command: - -```python -python main.py [arguments] -``` -## Arguments: - -`record_id`, `airline`, `avail_seat_km_per_week`, `incidents_85_99`, `fatal_accidents_85_99`, `fatalities_85_99`, `incidents_00_14`, `fatal_accidents_00_14`, `fatalities_00_14` - -## Actions: - -extract: Extract data from the source. +### Extract Data +To extract data from the CSV files, run: ```python python main.py extract ``` -transform_load: Transform and load data into the database. - +### Load Data +To transform and load data into the Databricks database, execute: ```python -python main.py transform_load +python main.py load ``` -update_record: Update an existing record in the database. +### Load Data +To transform and load data into the Databricks database, execute: ```python -python main.py update_record(1, "Air Canada", 2000000000, 3, 1, 5, 2, 0, 0) +python main.py load ``` -create_record: Create a new record in the database +## Execute SQL Query +To run a SQL query against the Databricks database, use: ```python -python main.py create_record("Air Canada", 1865253802, 2, 0, 0, 2, 0, 0) +python main.py query "" ``` -delete_record: delete an existing record in the database. -```python -python main.py delete_record(1) -``` -read_data: Read and display the top N rows from the database. +## Complex SQL query 1: + +```sql +SELECT + rg.Major, + rg.Major_category, + rg.Total AS Total_Undergrad_Grads, + gs.Grad_total AS Total_Grad_Students, + AVG(rg.Unemployment_rate) AS Avg_Undergrad_Unemployment_Rate, + AVG(gs.Grad_unemployment_rate) AS Avg_Grad_Unemployment_Rate, + AVG(rg.Median) AS Avg_Undergrad_Median_Salary, + AVG(gs.Grad_median) AS Avg_Grad_Median_Salary +FROM + RecentGradsDB rg +JOIN + GradStudentsDB gs +ON + rg.Major_code = gs.Major_code +GROUP BY + rg.Major_category, + rg.Major, + rg.Total, + gs.Grad_total +HAVING + AVG(rg.Unemployment_rate) < 0.06 +ORDER BY + rg.Total DESC; -```python -python main.py read_data(10) # Reads the top 10 rows ``` -general_query: Run a custom SQL query on the database. +This SQL query joins two tables, RecentGradsDB and GradStudentsDB, and retrieves aggregate information about undergraduate and graduate employment, salary statistics, and unemployment rates for different majors + +The query provides a list of majors along with details such as the total number of undergraduate and graduate students, the average unemployment rates, and the average median salaries for both undergraduate and graduate levels. The results are filtered to include only majors where the average undergraduate unemployment rate is below 6%, and the majors are sorted by the total number of undergraduates in descending order + +### Expected output: + +This output highlights majors with low unemployment rates and the comparison between undergraduate and graduate outcomes + +![SQL_query1_output](SQL_query1_output.PNG) + +## Complex SQL query 2: + +```sql + +SELECT Major, 'Undergrad' AS Degree_Level, Total AS Total_Students +FROM RecentGradsDB +WHERE Total > 5000 +UNION +SELECT Major, 'Graduate' AS Degree_Level, Grad_total AS Total_Students +FROM GradStudentsDB +WHERE Grad_total > 5000 +ORDER BY Total_Students DESC; -```python -python main.py "SELECT * FROM AirlineSafety WHERE airline = 'Aeroflot*'" ``` -## Testing -To run the test suite, use: +This SQL query combines data from two different tables (`RecentGradsDB` and `GradStudentsDB`) to show majors that have more than 5,000 students at both undergraduate and graduate levels, and it orders the results by the total number of students in descending order. + +`SELECT` statement Part1 (**Undergraduate data**): + +-Retrieves the Major, assigns the string `'Undergrad'` to the Degree_Level, and selects the total number of undergraduate students (Total) from the `RecentGradsDB` table. + +-**Filters** (`WHERE Total > 5000`) to include only majors with more than 5,000 undergraduate students. + +`SELECT` statement Part2 (Graduate data): + +-Retrieves the Major, assigns the string `'Graduate'` to the Degree_Level, and selects the total number of graduate students (Grad_total) from the `GradStudentsDB` table. + +-**Filters** (`WHERE Grad_total > 5000`) to include only majors with more than 5,000 graduate students. + +`UNION` operator: +Combines the results from the two SELECT statements, ensuring that any duplicates are removed. + +`ORDER BY` Total_Students DESC: + +Orders the combined result set by the total number of students (Total_Students) in descending order, showing majors with the highest total first. + +### Expected output: + +The output consists of a combined and sorted list of majors that have more than 5,000 students, with each entry labeled according to the degree level. The majors are ordered by the total number of students, showing those with the highest student counts first. + +![SQL_query2_output](SQL_query2_output.PNG) + +## Testing +run below command to test the script ```python -python -m pytest -vv --cov=main --cov=myLib test_*.py +pytest test_main.py ``` + ## References -https://github.com/nogibjj/sqlite-lab \ No newline at end of file +1. https://github.com/nogibjj/sqlite-lab +2. https://learn.microsoft.com/en-us/azure/databricks/dev-tools/python-sql-connector \ No newline at end of file diff --git a/SQL_query1_output.PNG b/SQL_query1_output.PNG new file mode 100644 index 0000000..e60eab6 Binary files /dev/null and b/SQL_query1_output.PNG differ diff --git a/SQL_query2_output.PNG b/SQL_query2_output.PNG new file mode 100644 index 0000000..714dd49 Binary files /dev/null and b/SQL_query2_output.PNG differ diff --git a/test_main.py b/test_main.py index 8d86feb..8920448 100644 --- a/test_main.py +++ b/test_main.py @@ -67,6 +67,6 @@ def test_general_query(): if __name__ == "__main__": - #test_extract() - #test_load() + test_extract() + test_load() test_general_query()