rdbms-subsetter

Generate a random sample of rows from a relational database that preserves referential integrity - so long as constraints are defined, all parent rows will exist for child rows.

Good for creating test/development databases from production. It's slow, but how often do you need to generate a test/development database?

Usage:

rdbms-subsetter <source SQLAlchemy connection string> <destination connection string> <fraction of rows to use>

Example:

rdbms-subsetter postgresql://:@/bigdb postgresql://:@/littledb 0.05

Valid SQLAlchemy connection strings are described here.

rdbms-subsetter promises that each child row will have whatever parent rows are required by its foreign keys. It will also try to include most child rows belonging to each parent row (up to the supplied --children parameter, default 3 each), but it can't make any promises. (Demanding all children can lead to infinite propagation in thoroughly interlinked databases, as every child record demands new parent records, which demand new child records, which demand new parent records... so increase --children with caution.)

When row numbers in your tables vary wildly (tens to billions, for example), consider using the -l flag, which reduces row counts by a logarithmic formula. If f is the fraction specified, and -l is set, and the original table has n rows, then each new table's row target will be:

math.pow(10, math.log10(n)*f)

A fraction of 0.5 seems to produce good results, converting 10 rows to 3, 1,000,000 to 1,000,000, and 1,000,000,000 to 31,622.

Rows are selected randomly, but for tables with a single primary key column, you can force rdbms-subsetter to include specific rows (and their dependencies) with force=<tablename>:<primary key value>. The immediate children of these rows are also exempted from the --children limit.

rdbms-subsetter only performs the INSERTS; it's your responsibility to set up the target database first, with its foreign key constraints. The easiest way to do this is with your RDBMS's dump utility. For example, for PostgreSQL,

pg_dump --schema-only -f schemadump.sql bigdb
createdb littledb
psql -f schemadump.sql littledb

You can pull rows from a non-default schema by passing --schema=<name>. Currently the target database must contain the corresponding tables in its own schema of the same name (moving between schemas of different names is not yet supported).

Installing

pip install rdbms-subsetter

Then the DB-API2 module for your RDBMS; for example, for PostgreSQL,

pip install psycopg2

Memory

Will consume memory roughly equal to the size of the extracted database. (Not the size of the source database!)

Development

https://github.com/18F/rdbms-subsetter

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitignore		.gitignore
.travis.yml		.travis.yml
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
README.rst		README.rst
setup.py		setup.py
subsetter.py		subsetter.py
test_subsetter.py		test_subsetter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rdbms-subsetter

Installing

Memory

Development

See also

About

Releases

Packages

Languages

License

cfgt/rdbms-subsetter

Folders and files

Latest commit

History

Repository files navigation

rdbms-subsetter

Installing

Memory

Development

See also

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages