-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add date range option for downloads. #65
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take another look tonight, but my brain cannot follow these helper functions at the moment.
I can't quite parse it right now so I'll just ask.
There's two ways we could do this (probabaly a million but you get it):
- pass the date range to the API we're using, and that will return exactly the data we want
- pull all the data and then on our end filter out the data points that do not fall within our date range.
Which one are you doing?
zipfile_bytes_io = BytesIO( | ||
requests.get( | ||
f"https://transitfeeds.com/p/chicago-transit-authority" | ||
f"/165/{version_id}/download" | ||
).content | ||
) | ||
) | ||
CTA_GTFS = zipfile.ZipFile(zipfile_bytes_io) | ||
logging.info('Download complete') | ||
return CTA_GTFS | ||
return CTA_GTFS, zipfile_bytes_io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this change, I think it's true that it's necessary that we have the BytesIO type data in order to do what we want with the date ranges, however, other parts of the code expect that we have the ZipFile type data. Is it then true that we are essentially returning duplicate data here?
I'm not suggesting that we change this, just want to make sure I'm understanding it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I didn't have a good way of obtaining the BytesIO
object that is needed for uploading files to s3. For other parts of the code that expect the ZipFile
output, I would use something like CTA_GTFS, _ = download_cta_zip()
, so that the BytesIO
object isn't created where it isn't needed.
Maybe creating a different data type would be better?
_ = keys(csrt.BUCKET_PUBLIC, file_dict[fname]) | ||
|
||
def extract_date(fname: str) -> str: | ||
return fname.split('_')[-1].split('.')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trust that this is the right way to parse it, is there a way I can see an example of the type of file that is being parsed though? If it's difficult, don't worry about it, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say you have
date_range = ['2023-01-01', '2023-05-05']
start_date = pendulum.parse(min(date_range))
end_date = pendulum.parse(max(date_range))
period = pendulum.period(start_date, end_date)
full_date_range = [dt.to_date_string() for dt in period.range('days')]
zip_filename_list = [f'cta_schedule_zipfiles_raw/google_transit_{date}.zip'
for date in full_date_range]
An example filename would look like
print(zip_filename_list[0])
cta_schedule_zipfiles_raw/google_transit_2023-01-01.zip
Calling extract_date
gives
print(extract_date(zip_filename_list[0]))
2023-01-01
It will split on '_' and take the last entry which is '2023-01-01.zip'. It then splits on '.' and takes the first entry, which is '2023-01-01'.
There's probably a library or regex that would be more robust though.
for fname in ['csv_filenames', 'zip_filenames']: | ||
print('Confirm that ' + ', '.join(file_dict[fname]) | ||
+ ' exist in bucket') | ||
_ = keys(csrt.BUCKET_PUBLIC, file_dict[fname]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in the keys
function implicitly confirms that the file has been saved?
Is it true that file_dict
is a mapping from a string to a list of strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the keys
function confirms that a list of filenames exists in the bucket, and file_dict
has values that are lists of strings.
It's a bit of both options. The schedule data that comes from the CTA is saved daily in s3. Only the specific files that fall in the date range are taken from there. For ealier schedule files that are unavailable from the CTA or do not exist on s3 for some reason, all of the schedule data since the start of our data collection on May 20, 2022 is pulled from transitfeeds.com and then filtered by the date range. |
|
||
confirm_saved_files(s3_route_daily_summary_dict) | ||
|
||
transitfeeds_list = list(set(zip_filename_list).difference(set(found_list))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that I think I would break out into like a one-time backfill... So instead of having this within the daily summaries, have like one function in a one-time script that checks for every date between 2022-05-20 and the present (maybe you can specify a smaller date range) and saves the zipfiles to S3 if they don't already exist. Maybe zipfiles from transitfeeds have a different name or something to distinguish them.
And then all the daily summary stuff would only look for individual zipfiles in S3 and if a zipfile is not present it can just print an error and keep going and we can kind of fully decouple the zipfile downloading from the daily summary generation.... If that makes sense?
We think this is superseded by #69 |
Description
Add date range argument for saving schedule summaries. Download files from transitfeeds.com if they are not available on the CTA website transitchicago.com
Resolves # [issue]
Type of change
How has this been tested?
Locally