Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep secondary page data separate from home page data (for now) #93

Closed
rviscomi opened this issue Jun 22, 2022 · 0 comments
Closed

Keep secondary page data separate from home page data (for now) #93

rviscomi opened this issue Jun 22, 2022 · 0 comments
Assignees

Comments

@rviscomi
Copy link
Member

The pages.2022_06_01_desktop table contains desktop page data for home pages. It's not clear that the pages.2022_06_09_desktop table contains desktop page data for both home pages and secondary pages.

Secondary pages are still an experimental feature, so we should keep them separate from the stable home page data. We're doing that to some extent with the _01 DD naming scheme for home-only tables, but this may still be causing confusion about which table to use and is already interfering with our automated analysis pipeline for reports on httparchive.org.

In order to alleviate this until the all dataset is ready in #15, move the home+secondary page tables to experimental_-prefixed datasets. Their DD names could be renamed back to 01 for consistency with their corresponding home-only tables.

Old New Eventually
pages.2022_06_09_desktop experimental_pages.2022_06_01_desktop all.pages (date=2022-06-01, client=desktop)
lighthouse_2022_05_16_mobile experimental_lighthouse.2022_05_01_mobile all.pages (date=2022-05-01, client=mobile)
summary_requests.2022_06_09_dekstop experimental_summary_requests.2022_06_01_desktop all.requests (date=2022-06-01, client=desktop)
summary_pages.2022_06_01_mobile (no change, already home-only) all.pages (date=2022-06-01, client=mbile)
@rviscomi rviscomi added this to the M2: Utilizing capacity milestone Jun 22, 2022
@rviscomi rviscomi self-assigned this Jun 22, 2022
rviscomi added a commit that referenced this issue Jul 1, 2022
* Reimplement HAR data pipeline with Apache Beam in Python (#93)

* python dataflow pipeline

* example dataflow job

* add support for LH, requests, and pages

* run the whole pipeline

* implement a maximum content size limit

make a deepcopy of the request before deleting the response body

* file notice

* optimize pipeline input

see https://stackoverflow.com/questions/60874942/avoid-recomputing-size-of-all-cloud-storage-files-in-beam-python-sdk

* improve pipeline

fix edge case in technologies transform
try to do more work in more CPUs

* use beam bq sink

* almost working

* integrate python pipeline into sync_har

* ujson

* handle nulls

* install ujson deps on workers

* FIXES

undo ujson, fix flatmap, upgrade to beam 2.26

* it works!

* Fix response_bodies pipeline (#123)

* python dataflow pipeline

* fix response_bodies

* Omit null response bodies (#125)

* Update HAR location to crawls directory

* Add extra check to prevent errors

* Improved exceptions

* Add support for LH desktop (#165)

* Use page URL from metadata (#166)

* Use 4 partitions for requests and response_bodies (#169)

* Update bigquery_import.py

* Update bigquery_import.py

* default url

* partitions

* debugging

* baa baa

* fix

* Ignore secondary pages in non-summary pipeline (#174)

* home pages only

* fix crawl_depth to 0

* move bigquery_import.py

* rename `ImportHarJson` to `HarJsonToSummary`

* refactor bigquery_import.py

* rename `import_har.py` to `summary_pipeline.py`

* rename `bigquery_import.py` to `non_summary_pipeline.py`

* Summary and non-summary pipeline refactors

* summary_pipeline.py: factor out steps following the read (i.e. flattening, BigQuery writes, dead-lettering)

* new custom `PipelineOptions` class for both summary_pipeline.py and non_summary_pipeline.py

* Add combined_pipeline.py

* Working progress on combined_pipeline.py

* combined_pipeline.py: non-summary tables working correctly

* constants.py: added non-summary tables and schemas

* non_summary_pipeline.py: updated partitioning logic

* non_summary_pipeline.py: added `client` and `date` to parsed data for programmatic routing to BigQuery tables

* run_pipeline.py: updated module name

* summary_pipeline.py: updated naming conventions

* transformation.py: separated HAR to summary code into separate classes for re-usability

* utils.py: added and updated some helper functions

* Rename summary BQ write ptransform

* Fix unittests

* Working progress on combined_pipeline.py

* combined_pipeline.py: summary tables working correctly

* constants.py: corrected non-summary schema formatting

* schemas: added non-summary table json files

* transformation.py: added deadletter logging helper; updated requestid logic

* Fix dataflow runner pickling

* add python pipeline runner scripts

* rename `non_summary_pipeline.WriteBigQuery` to `WriteNonSummaryToBigQuery`

* Remove call to `WriteNonSummaryToBigQuery.__init__().super()`

* remove `non_summary_pipeline.get_gcs_dir()`

* various updates to summary_pipeline.py

* linting fixes

* linting fixes

* linting fixes

* linting fixes

* linting fixes

* linting fixes

* Update scripts and docs for run_pipeline.py changes

* linting fixes

* Partitioning: parameterize and add unit test

* Partitioning: add unit test

* delete run_combined_pipeline.py

* Various updates

* combined_pipeline.py: added `CombinedPipelineOptions`; removed `run()`; added combined/summary/non-summary pipeline conditional logic

* non_summary_pipeline.py: explicit options for `WriteNonSummaryToBigQuery`; removed `NonSummaryPipelineOptions`, `create_pipeline()` and `run()`

* run_pipeline.py: removed conditional pipeline logic; added/centralized `run()`

* summary_pipeline.py: removed `SummaryPipelineOptions`, `create_pipeline()` and `run()`

* Add pipeline serialization unittest

* linting fixes

* Add home-only/secondary logic to non-summary pipeline

* Update non-summary partitioning logic

* linting fixes

* Add `--input_file` argument and fix partitioning

* trim parsed_css from pages payload (#99)

* response type (#100)

* Update modules/non_summary_pipeline.py

* Update modules/combined_pipeline.py

* linter

* linter

Co-authored-by: Rick Viscomi <[email protected]>
Co-authored-by: Barry <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant