Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MariaDB to PostgreSQL data migration #60

Closed
amyfromandi opened this issue Jul 17, 2024 · 6 comments
Closed

MariaDB to PostgreSQL data migration #60

amyfromandi opened this issue Jul 17, 2024 · 6 comments
Assignees

Comments

@amyfromandi
Copy link
Collaborator

amyfromandi commented Jul 17, 2024

This issue covers the current progress with the data migration initiative and outlines detailed next steps. Our goal is to eventually migrate all of the data in MariaDB into a PostgreSQL production database, ensuring the entire Macrostrat infrastructure operates off a single server.

Current milestones

  • We have taken Macrostrat production data from MariaDB, normalized it using pre-scripts and post-scripts, and successfully cloned the data into a PostgreSQL database: macrostrat_two, within the schema: macrostrat_temp.
  • This migration is performed within the Macrostrat's development environment.
  • We now have a fully functioning instance of MariaDB data in PostgreSQL, which is a big step towards managing all data in PostgreSQL.

We are currently in the process of updating a v2 API clone (possibly to become v3) by redirecting all endpoints from the MariaDB database to the PostgreSQL macrostrat_two clone. This step is to ensure that we can have a fully functioning API that works smoothly with PostgreSQL data only. We are almost complete with this step, and only have 7 more endpoints to redirect.

Next steps

After all endpoints are repointed, we need to do some in-depth testing to ensure that this API functions identically to production. As of now, we can currently say that the endpoints execute and successfully run when repointed to the PostgreSQL macrostrat_two clone. However, we need to ensure that data is being returned in the same format as production.

Code updates

Migration status reports

MariaDB to macrostrat_two (standalone PostgreSQL database)

Below are some stats that show our progress with cloning the MariaDB data into the PostgreSQL macrostrat_two database. These results (from compare_data_counts()) show the table, row count, and column count comparisons between any two databases. Stats are executed from utils.py in MariaDB Migration to PostgreSQL

  • Data Comparison: The comparison between the MariaDB production database and its clone on the PostgreSQL macrostrat_two database shows that all tables, row counts, and column counts are identical after the migration.
MARIADB (db1) comparison to PG MACROSTRAT_TWO (db2). These should be clones. 

Success! All tables exist in both MariaDB and PG Macrostrat_Two. Checking row counts....

Success! All row counts in all tables are the same in both MariaDB and PG Macrostrat_Two!

Success! All column counts in all tables are the same in both MariaDB and PG Macrostrat_Two!

macrostrat_two to existing Macrostrat database, macrostrat schema in PostgreSQL

We conducted a data comparison between the cloned PostgreSQL macrostrat_two database and the existing PostgreSQL macrostrat database in the Macrostrat development environment. The results indicate discrepancies in tables, row counts, and column counts.

Next Steps

We'll need to decide how to resolve these discrepancies. Potential solutions could be merging the macrostrat_two data into the macrostrat database, or vice versa. We could also maintain the data in separate databases as is. Let me know your thoughts on how to proceed.

PG MACROSTRAT_TWO (db1 maria db clone) comparison to PG MACROSTRAT (db2). This will show what data needs to be moved over from Maria to PG prod.

ERROR: PG Macrostrat_Two tables that are not in PG Macrostrat:
 ['canada_lexicon_dump', 'colors', 'col_areas_6april2016', 'col_equiv', 'col_notes', 'interval_boundaries', 'interval_boundaries_scratch',
'lookup_measurements', 'measuremeta_cols', 'minerals', 'offshore_hole_ages', 'offshore_sections', 'offshore_sites', 'pbdb_matches', 
rockd_features', 'ronov_sediment', 'stats', 'strat_names_lookup', 'structures', 'structure_atts', 'tectonics', 'temp_areas',
'uniquedatafiles2', 'units_datafiles', 'unit_boundaries_backup', 'unit_boundaries_scratch', 'unit_boundaries_scratch_old', 'unit_contacts',
'unit_equiv', 'unit_measures_pbdb', 'unit_seq_strat', 'unit_tectonics', 'offshore_baggage', 'offshore_fossils', 'unit_dates', 'unit_liths_atts',
'unit_notes']

ERROR: PG Macrostrat tables that are not in PG Macrostrat_Two: 
 ['strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names', 'unit_lith_atts', 'temp_rocks', 'temp_names']

ERROR: Row count differences for 25 tables in both PG Macrostrat_Two and PG Macrostrat databases:
Table Name: (PG Macrostrat_Two Rows, PG Macrostrat Rows)
{'autocomplete': (58361, 58488), 'cols': (5647, 5651), 'col_refs': (5709, 5713), 'intervals': (1510, 1626), 'liths': (200, 207),
'lookup_strat_names': (51211, 51214), 'lookup_units': (49812, 119462), 'lookup_unit_attrs_api': (49811, 119461), 'lookup_unit_intervals':
(49812, 119462), 'measuremeta': (129739, 129744), 'measures': (3418720, 3419018), 'projects': (11, 12), 'refs': (205, 210), 'sections'
(11586, 12713), 'strat_names': (51211, 51214), 'strat_names_meta': (43278, 43281), 'strat_tree': (29452, 29459), 'timescales': (25, 31),
'timescales_intervals': (1830, 1971), 'units_sections': (50386, 119762), 'unit_measures': (105302, 105048), 'unit_strat_names': (28315,
28319), 'units': (50134, 119510), 'unit_environs': (55578, 125140), 'unit_liths': (99062, 173515)}

ERROR: Column count differences for 14 tables in both PG Macrostrat_Two and PG Macrostrat databases:
Table Name: (PG Macrostrat_Two Columns, PG Macrostrat Columns)
{'cols': (13, 15), 'col_groups': (4, 3), 'environs': (6, 5), 'lith_atts': (5, 4), 'lookup_strat_names': (28, 22), 'lookup_unit_intervals': (19, 20),
'measuremeta': (14, 15), 'measures': (13, 12), 'sections': (6, 2), 'strat_names': (9, 5), 'strat_tree': (6, 4), 'unit_strat_names': (4, 3), 'units':
(15, 14), 'unit_environs': (7, 5)}
@davenquinn davenquinn changed the title MariaDB to PostgreSQL Data Migration MariaDB to PostgreSQL data migration Jul 17, 2024
@davenquinn
Copy link
Member

Some small questions @amyfromandi

There seem to be more rows in macrostrat vs. macrostrat_two in almost all cases. This seems surprising. We should expect to see more rows in macrostrat_two in almost all cases.

  • Additionally, with few exceptions, macrostrat should have a strict subset of primary keys that are in macrostrat_two. Perhaps we could add a quick check for that for certain tables?
  • We'll have to focus on what's going on with 'strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names', and 'unit_lith_atts'. Where do they come from? Are they used? Are they really not present in MariaDB? etc.
  • 'temp_rocks' and 'temp_names' are not important, presumably

@davenquinn
Copy link
Member

Also, one other comment — the v2 API here is meant to stay stable as v2, and not evolve to v3. So ensuring parity of responses from MariaDB to PostgreSQL is critical.

@davenquinn
Copy link
Member

davenquinn commented Jul 24, 2024

We have to look into the following tables to understand how they were migrated into PostgreSQL

  • strat_name_footprints
  • grainsize
  • pbdb_collections
  • pbdb_collections_strat_names
  • unit_lith_atts

The following tables are not important

  • temp_rocks
  • temp_names

@davenquinn
Copy link
Member

davenquinn commented Jul 24, 2024

  • strat_name_footprints is created by a script in the PostgreSQL database, as a sort of look-up table.
  • Seems bad that we're missing grainsize
  • unit_lith_atts is not properly present in MariaDB (maybe this is due to an improper dump). Edit: this has been renamed to unit_liths_atts

@davenquinn
Copy link
Member

davenquinn commented Jul 24, 2024

The following tables are not found in MariaDB but are important to preserve in PostgreSQL. We should copy their contents out of the macrostrat schema before doing a restore.

  • macrostrat.grainsize -> macrostrat_defs.grainsize
  • macrostrat.pbdb_collections -> macrostrat_pbdb.collections
  • macrostrat.pbdb_collections_strat_names -> macrostrat_pbdb.collections_strat_names

The tables should be "mirrored" into the macrostrat schema using view definitions so that APIs that refer to them don't break, e.g.

CREATE VIEW IF NOT EXISTS macrostrat.pbdb_collections AS
SELECT * FROM macrostrat_pbdb.collections;

@amyfromandi
Copy link
Collaborator Author

Some small questions @amyfromandi

There seem to be more rows in macrostrat vs. macrostrat_two in almost all cases. This seems surprising. We should expect to see more rows in macrostrat_two in almost all cases.

  • Additionally, with few exceptions, macrostrat should have a strict subset of primary keys that are in macrostrat_two. Perhaps we could add a quick check for that for certain tables?
  • We'll have to focus on what's going on with 'strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names', and 'unit_lith_atts'. Where do they come from? Are they used? Are they really not present in MariaDB? etc.
  • 'temp_rocks' and 'temp_names' are not important, presumably

Here are the latest variances after MariaDB was restored into the dev environment.

ERROR: Macrostrat_Two_Maria tables that are not in Macrostrat:
 ['stats', 'strat_names_lookup', 'structures', 'structure_atts', 'tectonics', 'temp_areas', 'uniquedatafiles2', 
'units_datafiles', 'unit_boundaries_backup', 'unit_boundaries_scratch', 'unit_boundaries_scratch_old', 
'unit_contacts', 'unit_dates', 'unit_equiv', 'unit_notes', 'unit_measures_pbdb', 'unit_seq_strat', 
'unit_tectonics', 'unit_liths_atts', 'canada_lexicon_dump', 'colors', 'col_areas_6april2016', 'col_equiv', 
'col_notes', 'interval_boundaries', 'interval_boundaries_scratch', 'lookup_measurements', 'offshore_baggage', 
'measuremeta_cols', 'minerals', 'offshore_fossils', 'offshore_baggage_units', 'offshore_hole_ages', 'offshore_sections', 
'offshore_sites', 'pbdb_intervals', 'pbdb_liths', 'pbdb_matches', 'rockd_features', 'ronov_sediment']
ERROR: Macrostrat tables that are not in Macrostrat_Two_Maria: 
 ['strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names', 'unit_lith_atts', 
'temp_rocks', 'temp_names']
ERROR: Row count differences for 26 tables in both Macrostrat_Two_Maria and Macrostrat databases:
Table Name: (Macrostrat_Two_Maria Rows, Macrostrat Rows)
{'sections': (12813, 12713), 'strat_names': (51229, 51210), 'strat_names_places': (50321, 50288), 
'strat_tree': (29467, 17519), 'units': (127229, 119508), 'timescales': (37, 31), 'timescales_intervals': (2194, 1970),
 'units_sections': (127481, 119762), 'unit_boundaries': (135708, 53799), 'unit_econs': (3157, 3148), 
'unit_measures': (105307, 105048), 'unit_strat_names': (28322, 28316), 'unit_environs': (134617, 125138),
 'unit_liths': (189907, 173513), 'autocomplete': (58599, 58488), 'cols': (5654, 5651), 'col_areas': (5357, 5351),
 'col_groups': (352, 350), 'col_refs': (5716, 5645), 'intervals': (1715, 1626), 'liths': (212, 207), 
'lookup_strat_names': (51229, 51214), 'lookup_units': (127229, 119462), 'lookup_unit_attrs_api': (127228, 119461), 
lookup_unit_intervals': (127229, 119462), 'refs': (213, 210)}
ERROR: Column count differences for 13 tables in both Macrostrat_Two_Maria and Macrostrat databases:
Table Name: (Macrostrat_Two_Maria Columns, Macrostrat Columns)
{'sections': (6, 2), 'strat_names': (9, 5), 'strat_tree': (6, 4), 'units': (15, 13), 'unit_strat_names': (4, 3),
 'unit_environs': (7, 5), 'cols': (13, 16), 'environs': (6, 5), 'lith_atts': (5, 4), 'lookup_strat_names': (28, 22),
 'lookup_unit_intervals': (19, 20), 'measuremeta': (14, 15), 'measures': (13, 12)}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants