Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider rewriting insertions using COPY #45

Open
albireox opened this issue Mar 16, 2021 · 0 comments
Open

Consider rewriting insertions using COPY #45

albireox opened this issue Mar 16, 2021 · 0 comments
Assignees

Comments

@albireox
Copy link
Member

albireox commented Mar 16, 2021

An alternative to creating a temporary table and then loading it into catalog and catalog_to_X is to COPY the query to a temporary CSV file and then COPY it back to both tables.

This probably would require create the temporary file with all the necessary columns and then select which ones to actualy copy depending on the destination table.

I think this would be useful for phases 1 and 3, probably not for 2.

To test this, we can compare it with some CREATE TABLE AS + INSERT. Using enable_hashjoin=false, this query

CREATE TEMPORARY TABLE IF NOT EXISTS "6cbada9d" AS SELECT DISTINCT ON ("t1"."object_id") "t1"."object_id" AS "target_id", "t2"."catalogid", (first_value("t1"."object_id") OVER (PARTITION BY "t2"."catalogid" ORDER BY "t1"."object_id" ASC) = "t1"."object_id") AS "best" FROM "catalogdb"."skymapper_dr2" AS "t1" INNER JOIN "catalogdb"."tic_v8" AS "t3" ON ("t1"."gaia_dr2_id1" = "t3"."gaia_int") INNER JOIN "catalogdb"."catalog_to_tic_v8" AS "t2" ON ("t2"."target_id" = "t3"."id") WHERE ((("t2"."version_id" = 25) AND ("t2"."best" IS true)) AND NOT EXISTS(SELECT 1 FROM "catalogdb"."catalog_to_skymapper_dr2" AS "t4" WHERE (("t4"."version_id" = 25) AND (("t4"."target_id" = "t1"."object_id") OR ("t4"."catalogid" = "t2"."catalogid")))))

takes ~4:30h while

INSERT INTO "catalogdb"."catalog_to_skymapper_dr2" ("target_id", "catalogid", "version_id", "best") SELECT "t1"."target_id", "t1"."catalogid", 25, "t1"."best" FROM "6cbada9d" AS "t1"

takes 2.5h. We can replace it with a

COPY (SELECT DISTINCT ON ("t1"."object_id") "t1"."object_id" AS "target_id", "t2"."catalogid", (first_value("t1"."object_id") OVER (PARTITION BY "t2"."catalogid" ORDER BY "t1"."object_id" ASC) = "t1"."object_id") AS "best" FROM "catalogdb"."skymapper_dr2" AS "t1" INNER JOIN "catalogdb"."tic_v8" AS "t3" ON ("t1"."gaia_dr2_id1" = "t3"."gaia_int") INNER JOIN "catalogdb"."catalog_to_tic_v8" AS "t2" ON ("t2"."target_id" = "t3"."id") WHERE ((("t2"."version_id" = 25) AND ("t2"."best" IS true))) TO '<tempfile>'

and then COPY to an empty table. Note that we need to remove the NOT EXISTS because if we repeat the query all the entries will already be inserted.

@albireox albireox self-assigned this Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant