Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate schedule downloads #61

Merged
merged 32 commits into from
Sep 20, 2023
Merged

Conversation

dcjohnson24
Copy link
Collaborator

@dcjohnson24 dcjohnson24 commented Jul 26, 2023

Description

Create a GitHub Action to download the schedule data from the CTA and save to s3. The workflow will be run every day at 5:30pm UTC or on a push to the automate-schedule-downloads branch. The push event trigger can be removed once the PR has been approved.

Resolves #18, working toward #50.

Type of change

  • Bug fix
  • New functionality
  • Documentation

How has this been tested?

Locally

Copy link
Member

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much Dylan this looks great! Just two comments about organization/file sizes in the bucket.

f'https://www.transitchicago.com/downloads/sch_data/google_transit.zip '
f'on {date} to public bucket')
zipfile_bytes_io.seek(0)
client.upload_fileobj(zipfile_bytes_io, 'chn-ghost-buses-public', f'google_transit_{date}.zip')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another directory level in this path? Right now the zipfiles are just being written to the root directory which doesn't pose a technical problem but will probably get a bit messy quickly. Maybe something like f'cta_schedule_zipfiles_raw/google_transit_{date}.zip?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I can add a new directory.

route_daily_summary.to_csv(csv_buffer)

print(f'Saving cta_route_daily_summary_{date}.csv to public bucket')
s3.Object('chn-ghost-buses-public', f'cta_route_daily_summary_{date}.csv')\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding an intermediate directory level for this path, and this one is a bit trickier since we already generate these files from the current batched process (currently in schedule_summaries/route_level)... We probably will need to figure out how to do a cutover from the old process to new.

Maybe f'schedule_summaries/daily_job/?

Another question here would be whether we want to only save that day's activity (i.e, route_daily_summary[route_daily_summary.date = {date}]), because otherwise these files are pretty big to just save literally every day.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving only that date would probably be good since we could concatenate the data later.

@lauriemerrell
Copy link
Member

Note from discussion 8/1: Maybe make two GitHub actions, one that only downloads and a second that creates the processed route trip count file based on the downloaded file

@lauriemerrell lauriemerrell merged commit 0a73534 into test_cronjob Sep 20, 2023
@lauriemerrell lauriemerrell deleted the automate-schedule-downloads branch September 20, 2023 02:15
haileyplusplus pushed a commit to haileyplusplus/chn-ghost-buses that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants