-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for hard removal of unused histories #17725
Comments
@mvdbeek So do you think we should add this as an executable script to |
I think this is likely only required for large, public instances that allow anon access. I think the first step should be just improving it with fixes for what you already found, run it against test and main and then maybe include it in gxadmin or pgcleanup.py |
Here's my proposal. There are multiple tables that reference the Among those tables there are those where we don't want to keep records that don't have a corresponding history record (like a history tag association); but there are others that we do (e.g. job, hda, etc.). For the former case, we delete the row, for the latter - set the column referencing history to NULL. For that to be possible, we need to modify the database schema and alter the column definitions in 6 tables: so they accept NULL values:
That would require a db migration. Next, we'd run this script: CREATE TEMPORARY TABLE tmp_unused_history(id INT PRIMARY KEY);
INSERT INTO tmp_unused_history
SELECT id FROM history WHERE user_id IS NULL AND hid_counter = 1 AND update_time < '01-01-2015';
BEGIN TRANSACTION;
-- delete rows we don't need to keep
DELETE FROM event
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_audit
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_tag_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_annotation_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_rating_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_user_share_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM default_history_permissions
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM data_manager_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM cleanup_event_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM galaxy_session_to_history
WHERE history_id IN (SELECT id FROM tmp_unused_history);
-- set history id to NULL in rows we want to keep
UPDATE job
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);
UPDATE history_dataset_association
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);
UPDATE history_dataset_collection_association
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);
UPDATE workflow_invocation
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);
UPDATE job_export_history_archive
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);
UPDATE job_import_history_archive
SET history_id = NULL
WHERE history_id in (SELECT id FROM tmp_unused_history);
UPDATE galaxy_session
SET current_history_id = NULL
WHERE current_history_id in (SELECT id FROM tmp_unused_history);
-- delete history rows
DELETE FROM history WHERE id IN (SELECT id FROM tmp_unused_history);
COMMIT TRANSACTION; Questions:
Does this make sense? Also, some numbers to help select the cutoff date:
|
Can we instead excluded histories that are referenced in these tables ? This is going to be a very small fraction of histories with no datasets. |
just accepting? or changing a default (older postgres rewrote on changing a default.) might be worth testing this migration on a replica of a big database just to be sure we document if the migration will take a long time.
anything before like, a month ago, should be fine.
I am not able to think of a situation where that wouldn't be safe, the user id is null, these can only be for anonymous sessions.
yes, this could run for ages. And folks will want to consider vacuuming after since it'll be a lot of dead tuples in the middle of the database. Did you consider batching, rather than nested transactions or multiple small ones? Something like this to pull the oldest empty histories, and clean those up. Admins could change the batch size to be what they're comfortable with. SELECT id FROM history WHERE user_id IS NULL AND hid_counter = 1 AND update_time < '01-01-2015' ORDER BY update_time ASC LIMIT 1000 might also drop the temp table afterwards, otherwise subsequent runs are going to be an issue. |
It doesn't seem like a good idea to add migrations to allow nullable columns to where we don't want nullable columns. This seems antithetical to a clean data model that was brought up in the discussion yesterday. 👍 to batches, that's how I tested https://gist.github.com/mvdbeek/fc352f0f21a5eca14b8f7416f34e95e5
IIRC we've done that in the past already |
Thanks for the detailed feedback! Posting draft here one more time before wrapping in a script. Edits:
I've considered deleting Batching seems to be the way to go, especially given that it gives other admins the flexibility to set the batch size in addition to the cut-off date. -- update time and size of limit will be configurable
BEGIN TRANSACTION;
-- setup and populate temporary table
CREATE TEMPORARY TABLE tmp_unused_history(id INT PRIMARY KEY) ON COMMIT DROP;
INSERT INTO tmp_unused_history
SELECT id FROM history
WHERE
user_id IS NULL
AND hid_counter = 1
AND update_time < '01-01-2015'
AND id NOT IN (SELECT history_id FROM job)
AND id NOT IN (SELECT history_id FROM history_dataset_association)
AND id NOT IN (SELECT history_id FROM history_dataset_collection_association)
AND id NOT IN (SELECT history_id FROM workflow_invocation)
AND id NOT IN (SELECT history_id FROM job_export_history_archive)
AND id NOT IN (SELECT history_id FROM job_import_history_archive)
LIMIT 10;
-- delete rows we don't need to keep
DELETE FROM event
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_audit
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_tag_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_annotation_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_rating_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_user_share_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM default_history_permissions
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM data_manager_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM cleanup_event_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM galaxy_session_to_history
WHERE history_id IN (SELECT id FROM tmp_unused_history);
-- set history id to NULL in rows in galaxy_session
UPDATE galaxy_session
SET current_history_id = NULL
WHERE current_history_id in (SELECT id FROM tmp_unused_history);
-- delete history rows
DELETE FROM history WHERE id IN (SELECT id FROM tmp_unused_history);
COMMIT TRANSACTION; |
yeah this is starting to look fantastic!
oh this will be very slow. hmm. you can make it you could stuff the six queries in a temp table via "union" since they're all the same shape and do a |
Yes, and not just that one, hda is just as bad. I don't have a clear solution yet. One approach might be to run deletes skipping over constraint errors (I don't know how to efficiently implement that yet). Another is to add WHERE clauses, which makes building the tmp table with history ids manageable. However, the big issue is that the db is live, and every minute we get new jobs. So, even if we filter out the ids of histories we want to delete that are not referenced from other tables, the next second a job may be created referencing a history from our list, and that will raise an error when trying to delete it. Locking the table is not an option, of course. |
@jdavcs yeah, that's a great suggestion. at that point you might even wrap it in a function. What about something like where we identify the range of integers that we're going to target, and then do those. (not necessary to use the table really, it's just a list of integers, but, it reads potentially a bit cleaner / less chance to miss one left/right bound on a WHERE and end up in a strange state.) CREATE OR REPLACE FUNCTION trim_empty_histories(start integer, count integer) AS $$
BEGIN;
-- update time and size of limit will be configurable
BEGIN TRANSACTION;
-- find our target history IDs
CREATE TEMPORARY TABLE tmp_target_unused_history(id INT) ON COMMIT DROP;
INSERT INTO tmp_target_unused_history
SELECT id FROM history
WHERE history_id > $1 AND history_id <= $1 + $2;
-- setup and populate temporary table
CREATE TEMPORARY TABLE tmp_unused_history(id INT PRIMARY KEY) ON COMMIT DROP;
INSERT INTO tmp_unused_history
SELECT id FROM history
WHERE
user_id IS NULL
AND hid_counter = 1
AND history_id in (select id from tmp_target_unused_history)
AND id NOT IN (SELECT history_id FROM job WHERE history_id in (select id from tmp_target_unused_history))
AND id NOT IN (SELECT history_id FROM history_dataset_association WHERE history_id in (select id from tmp_target_unused_history))
AND id NOT IN (SELECT history_id FROM history_dataset_collection_association WHERE history_id in (select id from tmp_target_unused_history))
AND id NOT IN (SELECT history_id FROM workflow_invocation WHERE history_id in (select id from tmp_target_unused_history))
AND id NOT IN (SELECT history_id FROM job_export_history_archive WHERE history_id in (select id from tmp_target_unused_history))
AND id NOT IN (SELECT history_id FROM job_import_history_archive WHERE history_id in (select id from tmp_target_unused_history))
LIMIT 10;
-- delete rows we don't need to keep
DELETE FROM event
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_audit
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_tag_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_annotation_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_rating_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM history_user_share_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM default_history_permissions
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM data_manager_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM cleanup_event_history_association
WHERE history_id IN (SELECT id FROM tmp_unused_history);
DELETE FROM galaxy_session_to_history
WHERE history_id IN (SELECT id FROM tmp_unused_history);
-- set history id to NULL in rows in galaxy_session
UPDATE galaxy_session
SET current_history_id = NULL
WHERE current_history_id in (SELECT id FROM tmp_unused_history);
-- delete history rows
DELETE FROM history WHERE id IN (SELECT id FROM tmp_unused_history);
COMMIT TRANSACTION;
$$ LANGUAGE sql; |
Good idea about wrapping in a function, thanks! I think the first tmp table is not necessary. If we want to work on ranges of history IDs, it'll be much faster to use a simple WHERE clause when constructing tmp_unused_history table: I think the main problem is that the db is live: no matter what table we pre-populate with history ids, there may be a job created with a history from that list after we do that, but before we try to delete that particular history. That'll raise an error. I think I'll try something like this to handle this problem:
|
no definitely not, was just to simplify the
but the jobs we'll be working on are ancient so, surely that's a low risk, right? if we're assuming the admin goes by history ID starting from low numbers.
if this is happening in a transaction it won't be visible to anything creating jobs, you'll have the same problem. Unless you do that in a separate transaction. |
Oh yes, separate transaction for sure. I think we can treat these as separate steps: (1) render unused histories unusable; (2) physical cleanup - so, I think, it's perfectly OK if the first commits but the second has a rollback.
Yes, of course! But once we get close enough (within a month?), it's not impossible. This extra steps would be a "due diligence" kind of thing, and it's not expensive. |
Not suggesting you don't bother with due diligence of course, just wanted to be sure whether writing the measures was worth it vs simply a transaction getting rolled back and trying it again. |
https://gist.github.com/mvdbeek/fc352f0f21a5eca14b8f7416f34e95e5 seems to "work" on usegalaxy.org, but I've rolled it back for now.
@jdavcs mentioned that we need to exclude some additional items:
The text was updated successfully, but these errors were encountered: