-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep secondary page data separate from home page data (for now) #93
Milestone
Comments
rviscomi
added a commit
that referenced
this issue
Jul 1, 2022
* Reimplement HAR data pipeline with Apache Beam in Python (#93) * python dataflow pipeline * example dataflow job * add support for LH, requests, and pages * run the whole pipeline * implement a maximum content size limit make a deepcopy of the request before deleting the response body * file notice * optimize pipeline input see https://stackoverflow.com/questions/60874942/avoid-recomputing-size-of-all-cloud-storage-files-in-beam-python-sdk * improve pipeline fix edge case in technologies transform try to do more work in more CPUs * use beam bq sink * almost working * integrate python pipeline into sync_har * ujson * handle nulls * install ujson deps on workers * FIXES undo ujson, fix flatmap, upgrade to beam 2.26 * it works! * Fix response_bodies pipeline (#123) * python dataflow pipeline * fix response_bodies * Omit null response bodies (#125) * Update HAR location to crawls directory * Add extra check to prevent errors * Improved exceptions * Add support for LH desktop (#165) * Use page URL from metadata (#166) * Use 4 partitions for requests and response_bodies (#169) * Update bigquery_import.py * Update bigquery_import.py * default url * partitions * debugging * baa baa * fix * Ignore secondary pages in non-summary pipeline (#174) * home pages only * fix crawl_depth to 0 * move bigquery_import.py * rename `ImportHarJson` to `HarJsonToSummary` * refactor bigquery_import.py * rename `import_har.py` to `summary_pipeline.py` * rename `bigquery_import.py` to `non_summary_pipeline.py` * Summary and non-summary pipeline refactors * summary_pipeline.py: factor out steps following the read (i.e. flattening, BigQuery writes, dead-lettering) * new custom `PipelineOptions` class for both summary_pipeline.py and non_summary_pipeline.py * Add combined_pipeline.py * Working progress on combined_pipeline.py * combined_pipeline.py: non-summary tables working correctly * constants.py: added non-summary tables and schemas * non_summary_pipeline.py: updated partitioning logic * non_summary_pipeline.py: added `client` and `date` to parsed data for programmatic routing to BigQuery tables * run_pipeline.py: updated module name * summary_pipeline.py: updated naming conventions * transformation.py: separated HAR to summary code into separate classes for re-usability * utils.py: added and updated some helper functions * Rename summary BQ write ptransform * Fix unittests * Working progress on combined_pipeline.py * combined_pipeline.py: summary tables working correctly * constants.py: corrected non-summary schema formatting * schemas: added non-summary table json files * transformation.py: added deadletter logging helper; updated requestid logic * Fix dataflow runner pickling * add python pipeline runner scripts * rename `non_summary_pipeline.WriteBigQuery` to `WriteNonSummaryToBigQuery` * Remove call to `WriteNonSummaryToBigQuery.__init__().super()` * remove `non_summary_pipeline.get_gcs_dir()` * various updates to summary_pipeline.py * linting fixes * linting fixes * linting fixes * linting fixes * linting fixes * linting fixes * Update scripts and docs for run_pipeline.py changes * linting fixes * Partitioning: parameterize and add unit test * Partitioning: add unit test * delete run_combined_pipeline.py * Various updates * combined_pipeline.py: added `CombinedPipelineOptions`; removed `run()`; added combined/summary/non-summary pipeline conditional logic * non_summary_pipeline.py: explicit options for `WriteNonSummaryToBigQuery`; removed `NonSummaryPipelineOptions`, `create_pipeline()` and `run()` * run_pipeline.py: removed conditional pipeline logic; added/centralized `run()` * summary_pipeline.py: removed `SummaryPipelineOptions`, `create_pipeline()` and `run()` * Add pipeline serialization unittest * linting fixes * Add home-only/secondary logic to non-summary pipeline * Update non-summary partitioning logic * linting fixes * Add `--input_file` argument and fix partitioning * trim parsed_css from pages payload (#99) * response type (#100) * Update modules/non_summary_pipeline.py * Update modules/combined_pipeline.py * linter * linter Co-authored-by: Rick Viscomi <[email protected]> Co-authored-by: Barry <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The
pages.2022_06_01_desktop
table contains desktop page data for home pages. It's not clear that thepages.2022_06_09_desktop
table contains desktop page data for both home pages and secondary pages.Secondary pages are still an experimental feature, so we should keep them separate from the stable home page data. We're doing that to some extent with the _01 DD naming scheme for home-only tables, but this may still be causing confusion about which table to use and is already interfering with our automated analysis pipeline for reports on httparchive.org.
In order to alleviate this until the
all
dataset is ready in #15, move the home+secondary page tables toexperimental_
-prefixed datasets. Their DD names could be renamed back to 01 for consistency with their corresponding home-only tables.The text was updated successfully, but these errors were encountered: