Operations cleanup #3589

dbutenhof · 2023-12-25T02:17:01Z

This addresses several issues encountered while monitoring the migration of tarballs from the passthrough server backup directories to the new production server.

First, I've seen PUT /upload problems more frequently than anticipated, and when transferring thousands of tarballs the error details get easily hidden: I've improved the way they're captured and reported at the end. Also, having observed many of the NGINX html format response messages, I decided to try scraping the text for the <title> tag text, which seems to contain the real error message, using BeautifulSoup.

Second, I ran into a set of tarballs from 2020 which seem to have metadata.log files which don't contain run.controller values. These, it turns out, fall into a hole in intake processing. Without a metadata.log at all, we just ignore the problem and use a default "controller" of unknown, but if the specific value is missing we fail the upload entirely with a poorly worded error message. It makes more sense to treat a missing run.controller the same way as a missing metadata.log.

Third, I've seen indexing failures on large "batches" (trying to index thousands of datasets in one run of the indexer) blowing up with memory problems that don't reproduce. Although it's not obvious from glancing through the main indexer loop, it seems likely there's a memory leak somewhere that's gradually building up. Since I can't find it (and I'm on vacation, so I didn't look excessively hard), I took another approach I'd considered earlier anyway and rejiggered the Sync.update to allow adding a SQL LIMIT to the query for READY datasets. This shouldn't have much impact on throughput as the indexer is serial and restarts every minute if it's not already/still busy, but it may keep the memory buildup below the danger threshold.

Only the migration utility changes have actually been tested "live", but the tests run.

This addresses several issues encountered while monitoring the migration of tarballs from the passthrough server backup directories to the new production server. First, I've seen `PUT /upload` problems more frequently than anticipated, and when transferring thousands of tarballs the error details get easily hidden: I've improved the way they're captured and reported at the end. Also, having observed many of the NGINX `html` format response messages, I decided to try scaping the text for the `<title>` tag text, which seems to contain the real error message, using BeautifulSoup. Second, I ran into a set of tarballs from 2020 which seem to have `metadata.log` files which don't contain `run.controller` values. These, it turns out, fall into a hole in intake processing. Without a `metadata.log` at all, we just ignore the problem and use a default "controller" of `unknown`, but if the specific value is missing we fail the upload entirely with a poorly worded error message. It makes more sense to treat a missing `run.controller` the same way as a missing `metadata.log`. Third, I've seen indexing failures on large "batches" (trying to index thousands of datasets in one run of the indexer) blowing up with memory problems that don't reproduce. Although it's not obvious from glancing through the main indexer loop, it seems likely there's a memory leak somewhere that's gradually building up. Since I can't find it (and I'm on vacation, so I didn't look excessively hard), I took another approach I'd considered earlier anyway and rejiggered the `Sync.update` to allow adding a SQL `LIMIT` to the query for `READY` datasets. This shouldn't have much impact on throughput as the indexer is serial and restarts every minute if it's not already/still busy, but it may keep the memory buildup below the danger threshold. Only the migration utility changes have actually been tested "live", but the tests run.

webbnh

Looks great!

dbutenhof added Server Contrib Indexing API Of and relating to application programming interfaces to services and functions Database Operations Related to operation and monitoring of a service labels Dec 25, 2023

dbutenhof requested a review from webbnh December 25, 2023 02:17

dbutenhof self-assigned this Dec 25, 2023

This comment was marked as resolved.

Sign in to view

dbutenhof added 2 commits January 2, 2024 16:22

Clarify missing controller logs

c4f7697

dbutenhof dismissed webbnh’s stale review via c4f7697 January 2, 2024 21:23

dbutenhof force-pushed the bigindex branch from 3587d5b to c4f7697 Compare January 2, 2024 21:23

webbnh approved these changes Jan 2, 2024

View reviewed changes

dbutenhof merged commit 010037d into distributed-system-analysis:main Jan 2, 2024
4 checks passed

dbutenhof deleted the bigindex branch January 2, 2024 23:35

webbnh pushed a commit that referenced this pull request Jan 4, 2024

Operations cleanup (#3589)

4d6f1d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations cleanup #3589

Operations cleanup #3589

dbutenhof commented Dec 25, 2023 •

edited by webbnh

Loading

This comment was marked as resolved.

webbnh left a comment

Operations cleanup #3589

Operations cleanup #3589

Conversation

dbutenhof commented Dec 25, 2023 • edited by webbnh Loading

This comment was marked as resolved.

webbnh left a comment

Choose a reason for hiding this comment

dbutenhof commented Dec 25, 2023 •

edited by webbnh

Loading