-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for deleting userless histories from database + testing + drop unused model testing code #18079
Script for deleting userless histories from database + testing + drop unused model testing code #18079
Conversation
Packages test failure relevant, due to failed import of pytest fixtures. Will fix. |
d038a75
to
018ecfe
Compare
Add __init__.py to test.unit.data.model.db
018ecfe
to
2d9f78f
Compare
I've moved fixtures into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! Should be a lot for us:
galaxy_main=> SELECT count(*) FROM history WHERE user_id IS NULL AND hid_counter = 1 AND create_time < now() - interval '30 days';
count
---------
5255712
(1 row)
galaxy_main=> SELECT count(*) FROM history;
count
---------
8914573
(1 row)
Deleting unused histories | ||
------------------------- | ||
|
||
Galaxy accommodates anonymous usage by creating a default history. Often, such histories will remain unused, as a result of which the database may contain a considerable number of anonymous histories along with associated records, which serve no purpose. Deleting such records will declutter the database and free up space. However, given that a row in the history table may be referenced from multiple other tables, manually deleting such data may leave the database in an inconsistent state. Furthermore, whereas some types of data associated with such histories are clearly obsolete and can be safely deleted, others may require preservation for a variety of reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is unused defined as never-used? Or does it also mean "was used, but no user associated and all datasets are purged"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Unused" here is defined as a history having hid_counter = 1 and no associated user. If, for whatever reason, the history has an associated job or dataset, that history will not be deleted. Even if the associated datasets were purged, those rows remain in the database and have to refer to a history row; deleting those histories would require deleting the associated rows, and even that is possible, that's the kind of data admins may want to preserve.
For reference, here's the output from a sample run of the script:
INFO:galaxy.model.scripts.history_table_pruner:Running batch 2 of 3: history id range 50-79:
INFO:galaxy.model.scripts.history_table_pruner: Marking histories as deleted and purged
INFO:galaxy.model.scripts.history_table_pruner: Collecting history ids
INFO:galaxy.model.scripts.history_table_pruner: Collecting ids of histories to exclude based on 6 associated tables:
INFO:galaxy.model.scripts.history_table_pruner: Collecting history_id from job_import_history_archive
INFO:galaxy.model.scripts.history_table_pruner: Collecting history_id from job_export_history_archive
INFO:galaxy.model.scripts.history_table_pruner: Collecting history_id from workflow_invocation
INFO:galaxy.model.scripts.history_table_pruner: Collecting history_id from history_dataset_collection_association
INFO:galaxy.model.scripts.history_table_pruner: Collecting history_id from job
INFO:galaxy.model.scripts.history_table_pruner: Collecting history_id from history_dataset_association
INFO:galaxy.model.scripts.history_table_pruner: Populating temporary table
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from event
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_audit
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_tag_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_annotation_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_rating_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from history_user_share_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from default_history_permissions
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from data_manager_history_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from cleanup_event_history_association
INFO:galaxy.model.scripts.history_table_pruner: Deleting associated records from galaxy_session_to_history
INFO:galaxy.model.scripts.history_table_pruner: Set history_id to null in galaxy_session
INFO:galaxy.model.scripts.history_table_pruner: Deleting histories in id range 50-79
This broke py312-lint tests, the following will fix it: diff --git a/lib/galaxy/model/scripts/history_table_pruner.py b/lib/galaxy/model/scripts/history_table_pruner.py
index c23813cbd7..ebd60ba063 100644
--- a/lib/galaxy/model/scripts/history_table_pruner.py
+++ b/lib/galaxy/model/scripts/history_table_pruner.py
@@ -64,7 +64,7 @@ class HistoryTablePruner:
high = get_high(low)
batch_counter = 1
while low <= self.max_id:
- log.info(f"Running batch {batch_counter} of {self.batches}: history id range {low}-{high-1}:")
+ log.info(f"Running batch {batch_counter} of {self.batches}: history id range {low}-{high - 1}:")
self._run_batch(low, high)
low = high
high = get_high(high)
@@ -76,7 +76,7 @@ class HistoryTablePruner:
return today.replace(month=today.month - 1)
def _run_batch(self, low, high):
- empty_batch_msg = f" No histories to delete in id range {low}-{high-1}"
+ empty_batch_msg = f" No histories to delete in id range {low}-{high - 1}"
self._mark_histories_as_deleted_and_purged(low, high)
histories = self._get_histories(low, high)
if not histories:
@@ -200,7 +200,7 @@ class HistoryTablePruner:
def _delete_histories(self, low, high):
"""Last step: delete histories that are safe to delete."""
- log.info(f" Deleting histories in id range {low}-{high-1}")
+ log.info(f" Deleting histories in id range {low}-{high - 1}")
stmt = text(f"DELETE FROM history WHERE id IN (SELECT id FROM {TMP_TABLE})")
with self.engine.begin() as conn:
conn.execute(stmt) |
Added commit to fix this to PR #18152 . |
Thank you for the fix, @nsoranzo! Sorry I missed that! |
Replaces #18058.
For description + initial review, please refer to #18058. This branch addresses code review comments + targets the dev branch (I think there's too much new functionality here to justify treating this as a bug fix and targeting an existing release).
I'm not sure if this is the correct way to make a script installable? Also, do we need to modify
sys.path?
(tests and fixtures are mostly extracted from #17662 and replace the dropped testing code)
TODO:
How to test the changes?
(Select all options that apply)
License