Skip to content

Commit

Permalink
fixed test_main issues
Browse files Browse the repository at this point in the history
  • Loading branch information
mobasserulHaque committed Oct 20, 2024
1 parent 1e9477e commit 61fe193
Show file tree
Hide file tree
Showing 4 changed files with 100 additions and 42 deletions.
138 changes: 98 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,79 +35,137 @@ By utilizing this pipeline, educators, policymakers, and students can better und
├── .github/
│ └── workflows/cicd.yml
├── data/
│ ├── airline_safety.csv
└── airline_safety.csv
│ ├── grad-students.csv
└── recent_grads.csv
├── myLib/
│ ├── __init__.py
│ ├── __pycache__/
│ ├── extract.py
│ ├── query.py
│ └── transform_load.py
├── AirlineSafetyDB.db
├── .gitignore
├── main.py
├── Makefile
├── query_log.md
├── query_output.md
├── README.md
├── ServeTimesDB.db requirements.txt
├── requirements.txt
└── test_main.py
```
## CRUD operations :

## Usage
To run the ETL process or execute queries, use the following commands:

Run the script using the following command:

```python
python main.py <action> [arguments]
```
## Arguments:

`record_id`, `airline`, `avail_seat_km_per_week`, `incidents_85_99`, `fatal_accidents_85_99`, `fatalities_85_99`, `incidents_00_14`, `fatal_accidents_00_14`, `fatalities_00_14`

## Actions:

extract: Extract data from the source.
### Extract Data
To extract data from the CSV files, run:

```python
python main.py extract
```
transform_load: Transform and load data into the database.

### Load Data
To transform and load data into the Databricks database, execute:
```python
python main.py transform_load
python main.py load
```
update_record: Update an existing record in the database.

### Load Data
To transform and load data into the Databricks database, execute:
```python
python main.py update_record(1, "Air Canada", 2000000000, 3, 1, 5, 2, 0, 0)
python main.py load
```
create_record: Create a new record in the database
## Execute SQL Query
To run a SQL query against the Databricks database, use:

```python
python main.py create_record("Air Canada", 1865253802, 2, 0, 0, 2, 0, 0)
python main.py query "<your_sql_query>"
```
delete_record: delete an existing record in the database.

```python
python main.py delete_record(1)
```
read_data: Read and display the top N rows from the database.
## Complex SQL query 1:

```sql
SELECT
rg.Major,
rg.Major_category,
rg.Total AS Total_Undergrad_Grads,
gs.Grad_total AS Total_Grad_Students,
AVG(rg.Unemployment_rate) AS Avg_Undergrad_Unemployment_Rate,
AVG(gs.Grad_unemployment_rate) AS Avg_Grad_Unemployment_Rate,
AVG(rg.Median) AS Avg_Undergrad_Median_Salary,
AVG(gs.Grad_median) AS Avg_Grad_Median_Salary
FROM
RecentGradsDB rg
JOIN
GradStudentsDB gs
ON
rg.Major_code = gs.Major_code
GROUP BY
rg.Major_category,
rg.Major,
rg.Total,
gs.Grad_total
HAVING
AVG(rg.Unemployment_rate) < 0.06
ORDER BY
rg.Total DESC;

```python
python main.py read_data(10) # Reads the top 10 rows
```
general_query: Run a custom SQL query on the database.
This SQL query joins two tables, RecentGradsDB and GradStudentsDB, and retrieves aggregate information about undergraduate and graduate employment, salary statistics, and unemployment rates for different majors

The query provides a list of majors along with details such as the total number of undergraduate and graduate students, the average unemployment rates, and the average median salaries for both undergraduate and graduate levels. The results are filtered to include only majors where the average undergraduate unemployment rate is below 6%, and the majors are sorted by the total number of undergraduates in descending order

### Expected output:

This output highlights majors with low unemployment rates and the comparison between undergraduate and graduate outcomes

![SQL_query1_output](SQL_query1_output.PNG)

## Complex SQL query 2:

```sql

SELECT Major, 'Undergrad' AS Degree_Level, Total AS Total_Students
FROM RecentGradsDB
WHERE Total > 5000
UNION
SELECT Major, 'Graduate' AS Degree_Level, Grad_total AS Total_Students
FROM GradStudentsDB
WHERE Grad_total > 5000
ORDER BY Total_Students DESC;

```python
python main.py "SELECT * FROM AirlineSafety WHERE airline = 'Aeroflot*'"
```

## Testing
To run the test suite, use:
This SQL query combines data from two different tables (`RecentGradsDB` and `GradStudentsDB`) to show majors that have more than 5,000 students at both undergraduate and graduate levels, and it orders the results by the total number of students in descending order.

`SELECT` statement Part1 (**Undergraduate data**):

-Retrieves the Major, assigns the string `'Undergrad'` to the Degree_Level, and selects the total number of undergraduate students (Total) from the `RecentGradsDB` table.

-**Filters** (`WHERE Total > 5000`) to include only majors with more than 5,000 undergraduate students.

`SELECT` statement Part2 (Graduate data):

-Retrieves the Major, assigns the string `'Graduate'` to the Degree_Level, and selects the total number of graduate students (Grad_total) from the `GradStudentsDB` table.

-**Filters** (`WHERE Grad_total > 5000`) to include only majors with more than 5,000 graduate students.

`UNION` operator:

Combines the results from the two SELECT statements, ensuring that any duplicates are removed.

`ORDER BY` Total_Students DESC:

Orders the combined result set by the total number of students (Total_Students) in descending order, showing majors with the highest total first.

### Expected output:

The output consists of a combined and sorted list of majors that have more than 5,000 students, with each entry labeled according to the degree level. The majors are ordered by the total number of students, showing those with the highest student counts first.

![SQL_query2_output](SQL_query2_output.PNG)

## Testing
run below command to test the script
```python
python -m pytest -vv --cov=main --cov=myLib test_*.py
pytest test_main.py
```

## References
https://github.com/nogibjj/sqlite-lab
1. https://github.com/nogibjj/sqlite-lab
2. https://learn.microsoft.com/en-us/azure/databricks/dev-tools/python-sql-connector
Binary file added SQL_query1_output.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added SQL_query2_output.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions test_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,6 @@ def test_general_query():


if __name__ == "__main__":
#test_extract()
#test_load()
test_extract()
test_load()
test_general_query()

0 comments on commit 61fe193

Please sign in to comment.