Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add date range option for downloads. #65

Closed
wants to merge 9 commits into from
Closed

Conversation

dcjohnson24
Copy link
Collaborator

Description

Add date range argument for saving schedule summaries. Download files from transitfeeds.com if they are not available on the CTA website transitchicago.com

Resolves # [issue]

Type of change

  • Bug fix
  • New functionality
  • Documentation

How has this been tested?

Locally

Copy link

@mesterhammerfic mesterhammerfic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take another look tonight, but my brain cannot follow these helper functions at the moment.

I can't quite parse it right now so I'll just ask.
There's two ways we could do this (probabaly a million but you get it):

  1. pass the date range to the API we're using, and that will return exactly the data we want
  2. pull all the data and then on our end filter out the data points that do not fall within our date range.
    Which one are you doing?

Comment on lines +362 to +370
zipfile_bytes_io = BytesIO(
requests.get(
f"https://transitfeeds.com/p/chicago-transit-authority"
f"/165/{version_id}/download"
).content
)
)
CTA_GTFS = zipfile.ZipFile(zipfile_bytes_io)
logging.info('Download complete')
return CTA_GTFS
return CTA_GTFS, zipfile_bytes_io

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this change, I think it's true that it's necessary that we have the BytesIO type data in order to do what we want with the date ranges, however, other parts of the code expect that we have the ZipFile type data. Is it then true that we are essentially returning duplicate data here?
I'm not suggesting that we change this, just want to make sure I'm understanding it correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didn't have a good way of obtaining the BytesIO object that is needed for uploading files to s3. For other parts of the code that expect the ZipFile output, I would use something like CTA_GTFS, _ = download_cta_zip(), so that the BytesIO object isn't created where it isn't needed.

Maybe creating a different data type would be better?

_ = keys(csrt.BUCKET_PUBLIC, file_dict[fname])

def extract_date(fname: str) -> str:
return fname.split('_')[-1].split('.')[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust that this is the right way to parse it, is there a way I can see an example of the type of file that is being parsed though? If it's difficult, don't worry about it, just curious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say you have

date_range = ['2023-01-01', '2023-05-05']
start_date = pendulum.parse(min(date_range))
end_date = pendulum.parse(max(date_range))
period = pendulum.period(start_date, end_date)
full_date_range = [dt.to_date_string() for dt in period.range('days')]
zip_filename_list = [f'cta_schedule_zipfiles_raw/google_transit_{date}.zip'
                                       for date in full_date_range]

An example filename would look like

print(zip_filename_list[0])
cta_schedule_zipfiles_raw/google_transit_2023-01-01.zip

Calling extract_date gives

print(extract_date(zip_filename_list[0]))
2023-01-01

It will split on '_' and take the last entry which is '2023-01-01.zip'. It then splits on '.' and takes the first entry, which is '2023-01-01'.

There's probably a library or regex that would be more robust though.

for fname in ['csv_filenames', 'zip_filenames']:
print('Confirm that ' + ', '.join(file_dict[fname])
+ ' exist in bucket')
_ = keys(csrt.BUCKET_PUBLIC, file_dict[fname])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in the keys function implicitly confirms that the file has been saved?
Is it true that file_dict is a mapping from a string to a list of strings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the keys function confirms that a list of filenames exists in the bucket, and file_dict has values that are lists of strings.

@dcjohnson24
Copy link
Collaborator Author

dcjohnson24 commented Sep 9, 2023

It's a bit of both options. The schedule data that comes from the CTA is saved daily in s3. Only the specific files that fall in the date range are taken from there. For ealier schedule files that are unavailable from the CTA or do not exist on s3 for some reason, all of the schedule data since the start of our data collection on May 20, 2022 is pulled from transitfeeds.com and then filtered by the date range.

Base automatically changed from automate-schedule-downloads to test_cronjob September 20, 2023 02:15

confirm_saved_files(s3_route_daily_summary_dict)

transitfeeds_list = list(set(zip_filename_list).difference(set(found_list)))
Copy link
Member

@lauriemerrell lauriemerrell Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that I think I would break out into like a one-time backfill... So instead of having this within the daily summaries, have like one function in a one-time script that checks for every date between 2022-05-20 and the present (maybe you can specify a smaller date range) and saves the zipfiles to S3 if they don't already exist. Maybe zipfiles from transitfeeds have a different name or something to distinguish them.

And then all the daily summary stuff would only look for individual zipfiles in S3 and if a zipfile is not present it can just print an error and keep going and we can kind of fully decouple the zipfile downloading from the daily summary generation.... If that makes sense?

Base automatically changed from test_cronjob to main October 11, 2023 01:13
@lauriemerrell
Copy link
Member

We think this is superseded by #69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants