-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operations cleanup #3589
Merged
dbutenhof
merged 2 commits into
distributed-system-analysis:main
from
dbutenhof:bigindex
Jan 2, 2024
Merged
Operations cleanup #3589
dbutenhof
merged 2 commits into
distributed-system-analysis:main
from
dbutenhof:bigindex
Jan 2, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dbutenhof
added
Server
Contrib
Indexing
API
Of and relating to application programming interfaces to services and functions
Database
Operations
Related to operation and monitoring of a service
labels
Dec 25, 2023
This addresses several issues encountered while monitoring the migration of tarballs from the passthrough server backup directories to the new production server. First, I've seen `PUT /upload` problems more frequently than anticipated, and when transferring thousands of tarballs the error details get easily hidden: I've improved the way they're captured and reported at the end. Also, having observed many of the NGINX `html` format response messages, I decided to try scaping the text for the `<title>` tag text, which seems to contain the real error message, using BeautifulSoup. Second, I ran into a set of tarballs from 2020 which seem to have `metadata.log` files which don't contain `run.controller` values. These, it turns out, fall into a hole in intake processing. Without a `metadata.log` at all, we just ignore the problem and use a default "controller" of `unknown`, but if the specific value is missing we fail the upload entirely with a poorly worded error message. It makes more sense to treat a missing `run.controller` the same way as a missing `metadata.log`. Third, I've seen indexing failures on large "batches" (trying to index thousands of datasets in one run of the indexer) blowing up with memory problems that don't reproduce. Although it's not obvious from glancing through the main indexer loop, it seems likely there's a memory leak somewhere that's gradually building up. Since I can't find it (and I'm on vacation, so I didn't look excessively hard), I took another approach I'd considered earlier anyway and rejiggered the `Sync.update` to allow adding a SQL `LIMIT` to the query for `READY` datasets. This shouldn't have much impact on throughput as the indexer is serial and restarts every minute if it's not already/still busy, but it may keep the memory buildup below the danger threshold. Only the migration utility changes have actually been tested "live", but the tests run.
webbnh
approved these changes
Jan 2, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This addresses several issues encountered while monitoring the migration of tarballs from the passthrough server backup directories to the new production server.
First, I've seen
PUT /upload
problems more frequently than anticipated, and when transferring thousands of tarballs the error details get easily hidden: I've improved the way they're captured and reported at the end. Also, having observed many of the NGINXhtml
format response messages, I decided to try scraping the text for the<title>
tag text, which seems to contain the real error message, using BeautifulSoup.Second, I ran into a set of tarballs from 2020 which seem to have
metadata.log
files which don't containrun.controller
values. These, it turns out, fall into a hole in intake processing. Without ametadata.log
at all, we just ignore the problem and use a default "controller" ofunknown
, but if the specific value is missing we fail the upload entirely with a poorly worded error message. It makes more sense to treat a missingrun.controller
the same way as a missingmetadata.log
.Third, I've seen indexing failures on large "batches" (trying to index thousands of datasets in one run of the indexer) blowing up with memory problems that don't reproduce. Although it's not obvious from glancing through the main indexer loop, it seems likely there's a memory leak somewhere that's gradually building up. Since I can't find it (and I'm on vacation, so I didn't look excessively hard), I took another approach I'd considered earlier anyway and rejiggered the
Sync.update
to allow adding a SQLLIMIT
to the query forREADY
datasets. This shouldn't have much impact on throughput as the indexer is serial and restarts every minute if it's not already/still busy, but it may keep the memory buildup below the danger threshold.Only the migration utility changes have actually been tested "live", but the tests run.