Skip to content

Commit

Permalink
add Mansueto cloudfront URL
Browse files Browse the repository at this point in the history
  • Loading branch information
lauriemerrell committed Nov 17, 2024
1 parent a561df2 commit e71a101
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions data_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,24 @@

All data scraped as part of this project is originally provided by the CTA and is subject to CTA's terms of use. Realtime data refers to data from the CTA's [bus tracker API's `getvehicles` feed](https://www.transitchicago.com/developers/bustracker/) (which we scrape every 5 minutes) and schedule data refers to data from the [CTA's GTFS feed](https://www.transitchicago.com/developers/gtfs/) (which we do not yet automatically scrape), sometimes collected via the [Transitfeeds archive](https://transitfeeds.com/p/chicago-transit-authority/165) to access historical / prior versioned feeds. Data collection started the evening of May 19; the first few files on that date are experimental so it's recommended to start analysis on 5/20 when the data is cleaner.

As of 10/15/22, the high-level data flow for the CHN Ghost Buses project is as follows:
As of 10/15/22, the high-level data flow for the CHN Ghost Buses project was as follows:

```mermaid
flowchart LR;
scraper[scrape_data.py scrapes raw realtime bus location data every 5 minutes] --writes data to--> s3[JSON files in chn-ghost-buses-public S3 bucket] --daily processed into--> csv[CSV files in chn-ghost-buses-public bucket]
```

## Accessing data in S3
As of fall 2024, the Mansueto Institute is maintaining the back end data flow.

Data in the public `chn-ghost-buses-public` bucket is publicly accessible for reading, but only @lauriemerrell can currently write data to it.
## Accessing data in S3

The data can be accessed several ways. To access an individual file, you can:

* Go directly to the URL for a resource in your browser to download the file locally. URLs follow the form `https://dmu5hq5f7fk32.cloudfront.net/{path_to_specific_file_as_described below}`; for example: `https://dmu5hq5f7fk32.cloudfront.net/bus_full_day_data_v2/2022-10-01.csv`.
* Go directly to the URL for a resource in your browser to download the file locally. URLs follow the form `https://d2v7z51jmtm0iq.cloudfront.net/cta-stop-watch/{path_to_specific_file_as_described below}`; for example: `https://d2v7z51jmtm0iq.cloudfront.net/cta-stop-watch/full_day_data/2022-10-01.csv`.

* Use standard request libraries and packages. For example, you can use [wget](https://www.gnu.org/software/wget/manual/) like `wget https://dmu5hq5f7fk32.cloudfront.net/bus_full_day_data_v2/2022-10-01.csv` (see above for notes on the URL format/construction).
* Use standard request libraries and packages. For example, you can use [wget](https://www.gnu.org/software/wget/manual/) like `wget https://d2v7z51jmtm0iq.cloudfront.net/cta-stop-watch/full_day_data/2022-10-01.csv` (see above for notes on the URL format/construction).

* Use Pandas, which recognizes `s3` file URIs if you install the `s3fs` dependency. So, for example, you can use: `pandas.read_csv('https://dmu5hq5f7fk32.cloudfront.net/bus_full_day_data_v2/2022-10-01.csv')` to load a file as a Pandas dataframe. *Note: Some members of the breakout group have experienced challenges accessing the data in this way; you can also use a full URL as described in the first bullet to load data into Pandas like: `pandas.read_csv('https://dmu5hq5f7fk32.cloudfront.net/bus_full_day_data_v2/2022-10-01.csv')`.*
* Use Pandas, which recognizes `s3` file URIs if you install the `s3fs` dependency. So, for example, you can use: `pandas.read_csv('https://d2v7z51jmtm0iq.cloudfront.net/cta-stop-watch/full_day_data/2022-10-01.csv')` to load a file as a Pandas dataframe.

## Available data

Expand All @@ -35,11 +35,11 @@ There are a few types of data available in the public S3 bucket. The **data** li
* `data_hour`: The hour extracted from the `data_time` (integer between 0 and 23).
* `data_date`: The date extracted from the `data_time`.
* These files are generated daily between 10 and 11am Central for the prior day. There is one file per full day from `2022-05-20` until the day before you are making the request. So, if you are checking on `2022-10-02` after 11am Central, data will be available up to and including `2022-10-01`.
* In S3, these are available in the `chn-ghost-buses-public` bucket in a folder called `bus_full_day_data_v2`. Full filenames are like `bus_full_day_data_v2/{date in YYYY-MM-DD format}.csv`. So, to load the data for `2022-10-01` in Pandas, you could do: `pandas.read_csv('https://dmu5hq5f7fk32.cloudfront.net/bus_full_day_data_v2/2022-10-01.csv')`.
* In S3, these are available in a folder called `full_day_data`. Full filenames are like `full_day_data/{date in YYYY-MM-DD format}.csv`. So, to load the data for `2022-10-01` in Pandas, you could do: `pandas.read_csv('https://d2v7z51jmtm0iq.cloudfront.net/cta-stop-watch/full_day_data/2022-10-01.csv')`.
* **Errors**:
* These are CSV files that contain all the error data we received from the API, concatenated together for a full day. The schema of the data is exactly what is returned from the API, with only a `scrape_file` field (see above) added that records the name of the S3 resource where the original JSON response that contained this row is saved; see [the `getvehicles` section of the CTA documentation]('https://www.transitchicago.com/assets/1/6/cta_Bus_Tracker_API_Developer_Guide_and_Documentation_20160929.pdf') for field definitions from the API.
* These files are generated daily between 10 and 11am Central for the prior day. There is one file per full day from `2022-05-20` until the day before you are making the request. So, if you are checking on `2022-10-02` after 11am Central, errors will be available up to and including `2022-10-01`.
* In S3, these are available in the `chn-ghost-buses-public` bucket in a folder called `bus_full_day_errors_v2`. Full filenames are like `bus_full_day_errors_v2/{date in YYYY-MM-DD format}.csv`. So, to load the data for `2022-10-01` in Pandas, you could do: `pandas.read_csv('https://dmu5hq5f7fk32.cloudfront.net/bus_full_day_errors_v2/2022-10-01.csv')`.
* In S3, these are available in a folder called `bus_full_day_errors_v2`. Full filenames are like `bus_full_day_errors_v2/{date in YYYY-MM-DD format}.csv`. So, to load the data for `2022-10-01` in Pandas, you could do: `pandas.read_csv('https://d2v7z51jmtm0iq.cloudfront.net/cta-stop-watch/bus_full_day_errors_v2/2022-10-01.csv')`.


## Script to load data
Expand All @@ -52,7 +52,7 @@ There are a few types of data available in the public S3 bucket. The **data** li
START_DATE = "2022-05-20"
END_DATE = ""
BUCKET_URL = "https://dmu5hq5f7fk32.cloudfront.net"
BUCKET_URL = "https://d2v7z51jmtm0iq.cloudfront.net/cta-stop-watch"
start_date = pendulum.parse(START_DATE, tz="America/Chicago")
is_after_11 = pendulum.now("America/Chicago").hour >= 11
Expand All @@ -68,7 +68,7 @@ There are a few types of data available in the public S3 bucket. The **data** li
errors_list = []
for d in date_list:
data_url = f"{BUCKET_URL}/bus_full_day_data_v2/{d}.csv"
data_url = f"{BUCKET_URL}/full_day_data/{d}.csv"
print(f"{pendulum.now()}: processing {d} data")
try:
daily_data = pd.read_csv(data_url, low_memory=False)
Expand Down

0 comments on commit e71a101

Please sign in to comment.