add date range option for downloads. #65

dcjohnson24 · 2023-08-30T01:06:19Z

Description

Add date range argument for saving schedule summaries. Download files from transitfeeds.com if they are not available on the CTA website transitchicago.com

Resolves # [issue]

Type of change

Bug fix
New functionality
Documentation

How has this been tested?

Locally

mesterhammerfic

I will take another look tonight, but my brain cannot follow these helper functions at the moment.

I can't quite parse it right now so I'll just ask.
There's two ways we could do this (probabaly a million but you get it):

pass the date range to the API we're using, and that will return exactly the data we want
pull all the data and then on our end filter out the data points that do not fall within our date range.
Which one are you doing?

mesterhammerfic · 2023-09-07T20:26:01Z

data_analysis/static_gtfs_analysis.py

+    zipfile_bytes_io = BytesIO(
            requests.get(
                f"https://transitfeeds.com/p/chicago-transit-authority"
                f"/165/{version_id}/download"
            ).content
        )
-    )
+    CTA_GTFS = zipfile.ZipFile(zipfile_bytes_io)
    logging.info('Download complete')
-    return CTA_GTFS
+    return CTA_GTFS, zipfile_bytes_io


Based on this change, I think it's true that it's necessary that we have the BytesIO type data in order to do what we want with the date ranges, however, other parts of the code expect that we have the ZipFile type data. Is it then true that we are essentially returning duplicate data here?
I'm not suggesting that we change this, just want to make sure I'm understanding it correctly.

Yeah I didn't have a good way of obtaining the BytesIO object that is needed for uploading files to s3. For other parts of the code that expect the ZipFile output, I would use something like CTA_GTFS, _ = download_cta_zip(), so that the BytesIO object isn't created where it isn't needed.

Maybe creating a different data type would be better?

mesterhammerfic · 2023-09-07T20:28:50Z

scrape_data/cta_data_downloads.py

+            _ = keys(csrt.BUCKET_PUBLIC, file_dict[fname])
+
+    def extract_date(fname: str) -> str:
+        return fname.split('_')[-1].split('.')[0]


I trust that this is the right way to parse it, is there a way I can see an example of the type of file that is being parsed though? If it's difficult, don't worry about it, just curious.

So let's say you have

date_range = ['2023-01-01', '2023-05-05'] start_date = pendulum.parse(min(date_range)) end_date = pendulum.parse(max(date_range)) period = pendulum.period(start_date, end_date) full_date_range = [dt.to_date_string() for dt in period.range('days')] zip_filename_list = [f'cta_schedule_zipfiles_raw/google_transit_{date}.zip' for date in full_date_range]

An example filename would look like

print(zip_filename_list[0]) cta_schedule_zipfiles_raw/google_transit_2023-01-01.zip

Calling extract_date gives

print(extract_date(zip_filename_list[0])) 2023-01-01

It will split on '_' and take the last entry which is '2023-01-01.zip'. It then splits on '.' and takes the first entry, which is '2023-01-01'.

There's probably a library or regex that would be more robust though.

mesterhammerfic · 2023-09-07T20:30:45Z

scrape_data/cta_data_downloads.py

+        for fname in ['csv_filenames', 'zip_filenames']:
+            print('Confirm that ' + ', '.join(file_dict[fname])
+                + ' exist in bucket')  
+            _ = keys(csrt.BUCKET_PUBLIC, file_dict[fname])


The code in the keys function implicitly confirms that the file has been saved?
Is it true that file_dict is a mapping from a string to a list of strings?

Yes, the keys function confirms that a list of filenames exists in the bucket, and file_dict has values that are lists of strings.

dcjohnson24 · 2023-09-09T01:58:42Z

It's a bit of both options. The schedule data that comes from the CTA is saved daily in s3. Only the specific files that fall in the date range are taken from there. For ealier schedule files that are unavailable from the CTA or do not exist on s3 for some reason, all of the schedule data since the start of our data collection on May 20, 2022 is pulled from transitfeeds.com and then filtered by the date range.

lauriemerrell · 2023-09-26T23:13:01Z

scrape_data/cta_data_downloads.py

+
+    confirm_saved_files(s3_route_daily_summary_dict)
+
+    transitfeeds_list = list(set(zip_filename_list).difference(set(found_list)))


This is the part that I think I would break out into like a one-time backfill... So instead of having this within the daily summaries, have like one function in a one-time script that checks for every date between 2022-05-20 and the present (maybe you can specify a smaller date range) and saves the zipfiles to S3 if they don't already exist. Maybe zipfiles from transitfeeds have a different name or something to distinguish them.

And then all the daily summary stuff would only look for individual zipfiles in S3 and if a zipfile is not present it can just print an error and keep going and we can kind of fully decouple the zipfile downloading from the daily summary generation.... If that makes sense?

lauriemerrell · 2023-10-25T01:41:02Z

We think this is superseded by #69

dcjohnson24 added 9 commits August 29, 2023 20:04

add date range option for downloads.

ae0cb03

Add branch for GitHub actions

68a55bd

Fix typing error with List[str, str]

640b30a

Add bucket name argument to download_fileobj

26169ad

Change Zipfile to ZipFile

9c2a93d

Add case for nothing to check in transitfeeds

48a1d53

Test with date range

ec95194

Add 2022 data from transitfeeds to s3

98cac8e

Shorten date range for testing

3ea9630

dcjohnson24 requested a review from lauriemerrell September 5, 2023 18:36

mesterhammerfic reviewed Sep 7, 2023

View reviewed changes

Base automatically changed from automate-schedule-downloads to test_cronjob September 20, 2023 02:15

lauriemerrell reviewed Sep 26, 2023

View reviewed changes

Base automatically changed from test_cronjob to main October 11, 2023 01:13

lauriemerrell closed this Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add date range option for downloads. #65

add date range option for downloads. #65

dcjohnson24 commented Aug 30, 2023

mesterhammerfic left a comment

mesterhammerfic Sep 7, 2023

dcjohnson24 Sep 9, 2023

mesterhammerfic Sep 7, 2023

dcjohnson24 Sep 9, 2023

mesterhammerfic Sep 7, 2023

dcjohnson24 Sep 9, 2023

dcjohnson24 commented Sep 9, 2023 •

edited

Loading

lauriemerrell Sep 26, 2023 •

edited

Loading

lauriemerrell commented Oct 25, 2023


		confirm_saved_files(s3_route_daily_summary_dict)

		transitfeeds_list = list(set(zip_filename_list).difference(set(found_list)))

add date range option for downloads. #65

add date range option for downloads. #65

Conversation

dcjohnson24 commented Aug 30, 2023

Description

Type of change

How has this been tested?

mesterhammerfic left a comment

Choose a reason for hiding this comment

mesterhammerfic Sep 7, 2023

Choose a reason for hiding this comment

dcjohnson24 Sep 9, 2023

Choose a reason for hiding this comment

mesterhammerfic Sep 7, 2023

Choose a reason for hiding this comment

dcjohnson24 Sep 9, 2023

Choose a reason for hiding this comment

mesterhammerfic Sep 7, 2023

Choose a reason for hiding this comment

dcjohnson24 Sep 9, 2023

Choose a reason for hiding this comment

dcjohnson24 commented Sep 9, 2023 • edited Loading

lauriemerrell Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

lauriemerrell commented Oct 25, 2023

dcjohnson24 commented Sep 9, 2023 •

edited

Loading

lauriemerrell Sep 26, 2023 •

edited

Loading