Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: optimize pg query for fk #146

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aokiji
Copy link

@aokiji aokiji commented Apr 9, 2024

Summary

the query that retrieved foreign keys for a schema joins several information schema views and is not as efficient as using directly the underlying catalog tables

Set up the environment

We will create a postgres 12 database and populate it with 100 tables with a foreign key constraint

Using the following init.sql script:

-- init.sql

CREATE TABLE IF NOT EXISTS public.client_types (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255)
);

-- create multiple tables with foreign keys in public
DO $$
DECLARE
    table_prefix TEXT := 'client_catalog_';
    number_of_tables_to_create INT := 100;
BEGIN

    -- create tables
    FOR i IN 1..number_of_tables_to_create LOOP
        EXECUTE format('
            CREATE TABLE IF NOT EXISTS public.%I ( 
                id SERIAL PRIMARY KEY,
                name VARCHAR(255),
                type_id INTEGER NOT NULL,
                FOREIGN KEY(type_id) REFERENCES client_types(id)
            )' , table_prefix || i::text);

    END LOOP;
END $$;

First start a postgres server that initializes with init.sql

$ docker run --rm -p 15432:5432 -e POSTGRES_USER=example_user -e POSTGRES_PASSWORD=example_password -e POSTGRES_DB=example_db -v $PWD/init.sql:/docker-entrypoint-initdb.d/init.sql:ro postgres:12-alpine 

Proposal

Let's first test that both the current query and the new proposal produce the same result.

The current query uses several information schema views joined, the filters given don't allow for trimming results before joining, this is the current query:

-- current_query.sql
	select kcu.CONSTRAINT_NAME,
       kcu.TABLE_NAME,
       kcu.COLUMN_NAME,
       rel_kcu.TABLE_NAME,
       rel_kcu.COLUMN_NAME
	from INFORMATION_SCHEMA.TABLE_CONSTRAINTS tco
			 join INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu
				  on tco.CONSTRAINT_SCHEMA = kcu.CONSTRAINT_SCHEMA
					  and tco.CONSTRAINT_NAME = kcu.CONSTRAINT_NAME
			 join INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS rco
				  on tco.CONSTRAINT_SCHEMA = rco.CONSTRAINT_SCHEMA
					  and tco.CONSTRAINT_NAME = rco.CONSTRAINT_NAME
			 join INFORMATION_SCHEMA.KEY_COLUMN_USAGE rel_kcu
				  on rco.UNIQUE_CONSTRAINT_SCHEMA = rel_kcu.CONSTRAINT_SCHEMA
					  and rco.UNIQUE_CONSTRAINT_NAME = rel_kcu.CONSTRAINT_NAME
					  and kcu.ORDINAL_POSITION = rel_kcu.ORDINAL_POSITION
	where tco.CONSTRAINT_TYPE = 'FOREIGN KEY'
	  and tco.CONSTRAINT_SCHEMA = 'public'
	order by kcu.CONSTRAINT_NAME,
			 kcu.ORDINAL_POSITION

The optimized query uses directly the catalog tables where results are filtered before join

-- new_query.sql
	SELECT fk.conname AS constraint_name, c1.relname AS table_name, a1.attname AS column_name, c2.relname AS
	    foreign_table_name, a2.attname AS foreign_column_name
	FROM pg_catalog.pg_constraint fk
	    JOIN pg_catalog.pg_class c1 ON c1.oid = fk.conrelid
	    JOIN pg_catalog.pg_attribute a1 ON a1.attrelid = c1.oid
		AND a1.attnum = ANY (fk.conkey)
	    JOIN pg_catalog.pg_class c2 ON c2.oid = fk.confrelid
	    JOIN pg_catalog.pg_attribute a2 ON a2.attrelid = c2.oid
		AND a2.attnum = ANY (fk.confkey)
	WHERE fk.contype = 'f'
	    AND fk.connamespace = 'public'::regnamespace::oid
	    AND (pg_has_role(c1.relowner, 'USAGE')
		OR has_table_privilege(c1.oid, 'INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER')
		OR has_any_column_privilege(c1.oid, 'INSERT, UPDATE, REFERENCES'))
	    AND (pg_has_role(c2.relowner, 'USAGE')
		OR has_table_privilege(c2.oid, 'INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER')
		OR has_any_column_privilege(c2.oid, 'INSERT, UPDATE, REFERENCES'))
	ORDER BY constraint_name, a1.attnum;

We can compare the results via the following command:

$ psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql > current_query_results
$ psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql > new_query_results
$ diff -s current_query_results new_query_results
Files current_query_results and new_query_results are identical

Now that we know that both produce the same result, let's check the performance

Performance analysis

First, the performance for the current query:

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql
  Time (mean ± σ):     689.1 ms ±  19.0 ms    [User: 30.8 ms, System: 10.1 ms]
  Range (min … max):   650.6 ms … 724.4 ms    10 runs

And the result for the new query:

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql
  Time (mean ± σ):      54.7 ms ±  13.1 ms    [User: 32.7 ms, System: 10.1 ms]
  Range (min … max):    41.1 ms …  85.5 ms    57 runs

Do note that if we create the database with 200 tables instead of 100, the results change greatly:

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < current_query.sql
  Time (mean ± σ):      4.641 s ±  0.246 s    [User: 0.032 s, System: 0.009 s]
  Range (min … max):    4.406 s …  4.977 s    10 runs

$ hyperfine 'psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql'
Benchmark 1: psql postgres://example_user:example_password@localhost:15432/example_db -t -A < new_query.sql
  Time (mean ± σ):      61.8 ms ±  13.7 ms    [User: 37.4 ms, System: 9.2 ms]
  Range (min … max):    45.6 ms …  86.4 ms    34 runs

there is a big jump in the currrent query's performance but not for the new proposal

the query that retrieved foreign keys for a schema joins several
information schema views and is not as efficient as using directly the
underlying catalog tables
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant