Skip to content

Commit

Permalink
Merge pull request #78 from haileyplusplus/refactor2
Browse files Browse the repository at this point in the history
Refactor and add incremental workflow functionality.
  • Loading branch information
lauriemerrell authored Oct 30, 2024
2 parents e1eaf06 + 2c94553 commit 42daeb2
Show file tree
Hide file tree
Showing 13 changed files with 14,368 additions and 589 deletions.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,27 @@ BUCKET_PRIVATE='chn-ghost-buses-private'
```
Note that there may be permissions issues when running `rt_daily_aggregations.py` with `BUCKET_PUBLIC` as the `BUCKET_NAME`.

You may also need to set the PROJECT_NAME environment variable:

```
PROJECT_NAME=chn-ghost-buses
```

## Updating data

The `update_data` script reads realtime and schedule data, combines it, and produces output files used by the frontend. The common use case will be to process data that has arrived since the previous frontend data deploy. To do this, check out the `ghost-buses-frontend` repository alongside this backend repository. Then run the following:

`python3 -m update_data --update ../ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json`

To calculate all data from the start of the project, run the script with no arguments:

`python3 -m update_data`

In either case, output files will be written under the `data_output/scratch` directory with names indicating the date ranges that they cover. You can then deploy by copying the following output files:

- `schedule_vs_realtime_all_day_types_overall_<date>_to_<date>.json` to `ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json`
- `frontend_data_<date>_to_<date>_wk.json` to `ghost-buses-frontend/src/Routes/data.json`

## Pre-commit hooks
Pre-commit hooks are defined in `.pre-commit-config.yaml`, based on this [post](https://towardsdatascience.com/pre-commit-hooks-you-must-know-ff247f5feb7e). The hooks will enforce style with the linter `black`, check for commited credentials and extraneous debugger breakpoints, and sort imports, among other things.

Expand Down
51 changes: 51 additions & 0 deletions data_analysis/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""
Utility functions common to both schedule and realtime analysis.
"""
from dataclasses import dataclass, field
from typing import List, Tuple

import pandas as pd


@dataclass
class AggInfo:
"""A class for storing information about
aggregation of route and schedule data
Args:
freq (str, optional): An offset alias described in the Pandas
time series docs. Defaults to None.
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
aggvar (str, optional): variable to aggregate by.
Defaults to trip_count
byvars (List[str], optional): variables to passed to
pd.DataFrame.groupby. Defaults to ['date', 'route_id'].
"""
freq: str = 'D'
aggvar: str = 'trip_count'
byvars: List[str] = field(default_factory=lambda: ['date', 'route_id'])


def sum_by_frequency(
df: pd.DataFrame,
agg_info: AggInfo) -> pd.DataFrame:
"""Calculate total trips per route per frequency
Args:
df (pd.DataFrame): A DataFrame of route or scheduled route data
agg_info (AggInfo): An AggInfo object describing how data
is to be aggregated.
Returns:
pd.DataFrame: A DataFrame with the total number of trips per route
by a specified frequency.
"""
df = df.copy()
out = (
df.set_index(agg_info.byvars)
.groupby(
[pd.Grouper(level='date', freq=agg_info.freq),
pd.Grouper(level='route_id')])[agg_info.aggvar]
.sum().reset_index()
)
return out
Loading

0 comments on commit 42daeb2

Please sign in to comment.