Keep secondary page data separate from home page data (for now) #93

rviscomi · 2022-06-22T14:41:06Z

The pages.2022_06_01_desktop table contains desktop page data for home pages. It's not clear that the pages.2022_06_09_desktop table contains desktop page data for both home pages and secondary pages.

Secondary pages are still an experimental feature, so we should keep them separate from the stable home page data. We're doing that to some extent with the _01 DD naming scheme for home-only tables, but this may still be causing confusion about which table to use and is already interfering with our automated analysis pipeline for reports on httparchive.org.

In order to alleviate this until the all dataset is ready in #15, move the home+secondary page tables to experimental_-prefixed datasets. Their DD names could be renamed back to 01 for consistency with their corresponding home-only tables.

Old	New	Eventually
pages.2022_06_09_desktop	experimental_pages.2022_06_01_desktop	all.pages (date=2022-06-01, client=desktop)
lighthouse_2022_05_16_mobile	experimental_lighthouse.2022_05_01_mobile	all.pages (date=2022-05-01, client=mobile)
summary_requests.2022_06_09_dekstop	experimental_summary_requests.2022_06_01_desktop	all.requests (date=2022-06-01, client=desktop)
summary_pages.2022_06_01_mobile	(no change, already home-only)	all.pages (date=2022-06-01, client=mbile)

The text was updated successfully, but these errors were encountered:

* Reimplement HAR data pipeline with Apache Beam in Python (#93) * python dataflow pipeline * example dataflow job * add support for LH, requests, and pages * run the whole pipeline * implement a maximum content size limit make a deepcopy of the request before deleting the response body * file notice * optimize pipeline input see https://stackoverflow.com/questions/60874942/avoid-recomputing-size-of-all-cloud-storage-files-in-beam-python-sdk * improve pipeline fix edge case in technologies transform try to do more work in more CPUs * use beam bq sink * almost working * integrate python pipeline into sync_har * ujson * handle nulls * install ujson deps on workers * FIXES undo ujson, fix flatmap, upgrade to beam 2.26 * it works! * Fix response_bodies pipeline (#123) * python dataflow pipeline * fix response_bodies * Omit null response bodies (#125) * Update HAR location to crawls directory * Add extra check to prevent errors * Improved exceptions * Add support for LH desktop (#165) * Use page URL from metadata (#166) * Use 4 partitions for requests and response_bodies (#169) * Update bigquery_import.py * Update bigquery_import.py * default url * partitions * debugging * baa baa * fix * Ignore secondary pages in non-summary pipeline (#174) * home pages only * fix crawl_depth to 0 * move bigquery_import.py * rename `ImportHarJson` to `HarJsonToSummary` * refactor bigquery_import.py * rename `import_har.py` to `summary_pipeline.py` * rename `bigquery_import.py` to `non_summary_pipeline.py` * Summary and non-summary pipeline refactors * summary_pipeline.py: factor out steps following the read (i.e. flattening, BigQuery writes, dead-lettering) * new custom `PipelineOptions` class for both summary_pipeline.py and non_summary_pipeline.py * Add combined_pipeline.py * Working progress on combined_pipeline.py * combined_pipeline.py: non-summary tables working correctly * constants.py: added non-summary tables and schemas * non_summary_pipeline.py: updated partitioning logic * non_summary_pipeline.py: added `client` and `date` to parsed data for programmatic routing to BigQuery tables * run_pipeline.py: updated module name * summary_pipeline.py: updated naming conventions * transformation.py: separated HAR to summary code into separate classes for re-usability * utils.py: added and updated some helper functions * Rename summary BQ write ptransform * Fix unittests * Working progress on combined_pipeline.py * combined_pipeline.py: summary tables working correctly * constants.py: corrected non-summary schema formatting * schemas: added non-summary table json files * transformation.py: added deadletter logging helper; updated requestid logic * Fix dataflow runner pickling * add python pipeline runner scripts * rename `non_summary_pipeline.WriteBigQuery` to `WriteNonSummaryToBigQuery` * Remove call to `WriteNonSummaryToBigQuery.__init__().super()` * remove `non_summary_pipeline.get_gcs_dir()` * various updates to summary_pipeline.py * linting fixes * linting fixes * linting fixes * linting fixes * linting fixes * linting fixes * Update scripts and docs for run_pipeline.py changes * linting fixes * Partitioning: parameterize and add unit test * Partitioning: add unit test * delete run_combined_pipeline.py * Various updates * combined_pipeline.py: added `CombinedPipelineOptions`; removed `run()`; added combined/summary/non-summary pipeline conditional logic * non_summary_pipeline.py: explicit options for `WriteNonSummaryToBigQuery`; removed `NonSummaryPipelineOptions`, `create_pipeline()` and `run()` * run_pipeline.py: removed conditional pipeline logic; added/centralized `run()` * summary_pipeline.py: removed `SummaryPipelineOptions`, `create_pipeline()` and `run()` * Add pipeline serialization unittest * linting fixes * Add home-only/secondary logic to non-summary pipeline * Update non-summary partitioning logic * linting fixes * Add `--input_file` argument and fix partitioning * trim parsed_css from pages payload (#99) * response type (#100) * Update modules/non_summary_pipeline.py * Update modules/combined_pipeline.py * linter * linter Co-authored-by: Rick Viscomi <[email protected]> Co-authored-by: Barry <[email protected]>

rviscomi added this to the M2: Utilizing capacity milestone Jun 22, 2022

rviscomi self-assigned this Jun 22, 2022

rviscomi closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep secondary page data separate from home page data (for now) #93

Keep secondary page data separate from home page data (for now) #93

rviscomi commented Jun 22, 2022

Keep secondary page data separate from home page data (for now) #93

Keep secondary page data separate from home page data (for now) #93

Comments

rviscomi commented Jun 22, 2022