Create suite of sanity tests across DBs and types for equivalency #39

chadlwilson · 2021-10-25T05:07:26Z

Context / Goal

Since the tool is designed to help you validate migration of data across different schema types and even (relational) database implementations, and primarily works based on hashed data representing a row in a dataset, we want to have a way to validate that it is actually working correctly for this purpose.

Since there are potentially differences in the way drivers handle things such as character encodings, number types, timestamp/data types we want to ensure that the hashed data representing one data type in one database is considered hash-identical to that of another. If there is not, there should be a way to ensure they are so using something simple in SQL, or we should probably change our implementation.

Expected Outcome

Modify/refactor the MultiDataSourceConnectivityIntegrationTest from Configure integration test tooling to allow testing across DB types #28 to instead of just testing R2DBC connectivity via Micronaut, instead have a suite of simple scenarios that focus on real integration testing scenarios, focused on type differences
- It is likely that the hashing impl in HashedRow is not going to work correctly. An int with value 10 in one DB will likely not be considered equal to a long with value 10 in another DB, and similar with other types. We will have to make some decisions about how this should work to canonicalize values, and how configurable it needs to be.
- Should a string of "10" be considered equal to a bigint of 10?
Run simple, but real reconciliations that focus on ensuring that hashed values from one DB of a given type are equivalent to that of a different DB

Out of Scope

Anything we don't support in terms of types (without SQL coercion) from Support mapping/hashing of more common data types #27

Additional context / implementation notes

At time of writing we are using the Exposed framework in test code to generate schemas for testing with. This may not give us the level of control over data types in the databases that we require, and may need to be re-evaluated.
Possibly these tests could run in a matrix style with simple data set queries on each side like SELECT id as MigrationKey, test_type_column FROM testdata, creating a single table with a single test data column per test
- Dimension 1: DB (mysql, postgres, mssql)
- Dimension 2: DB Type under test (CHAR/VARCHAR, INT/INTEGER, BIGINT, NUMERIC/DECIMAL/REAL/FLOAT, DATETIME/DATE/TIME/TIMESTAMP etc)
Types

The text was updated successfully, but these errors were encountered:

… types In practice, different drivers return different types at runtime. The previous strategy is strict, i.e if one driver/DB column returns a 32-bit int and another returns a 64-bit long, they will be considered unequal, regardless of their values. This is probably not desirable. This introduces a strategy that allows a more lenient approach to hashing, which is most likely to be desirable, especially when comparing across DB types, and different schemas which might change the types at DB level for efficiency.

…ategy

… for individual datasets

Allows us to start running real reconciliations against the two databases

…column See r2dbc/r2dbc-mssql#235 - this should be removed with r2dbc-mssql `0.8.8` or `0.9`+

…ypes Rather than running test/dummy connectivity queries with R2DBC, run actual reconciliations here

This is more flexible as we will likely need to define the schema programmatically and dynamically to make the tests super-flexible and easy to debug. Exposed is more of a compile-time thing, so likely not really what we need for this type of testing of data types.

Currently we only test with the types which can be declared with a common type alias across the four databases. Challenges * MSSQL/SQL server does not have a BOOL/BOOLEAN type * BIT type is handled different across R2DBC drivers. Some return a BitSet or Boolean, some a ByteBuffer * Comparing floating point numbers doesn't really make sense. They tend to differ and fail the test. Might need to recommend against trying to compare such types, or introduce a "warning" system that will help people debug these.

…no Docker

This avoids them hanging around longer than currently needed

Use of Exposed was not really helping that much. For now, easier and more consistent to use Flyway everywhere, and move to something else that allows SQL generation later if necessary. One less thing for people to learn for now.

chadlwilson · 2021-12-01T09:11:47Z

This is done now.

As part of this

it was immediately obvious that the previous hashing approach was not going to work well and was too strict and lead to differences across DBs, simply due to the drivers deserializing types slightly differently. As a result HashingStrategy was introduced and TypeLenient added and made the default approach. This leads to a lot better results.
MariaDB testing was added
MultiDataSourceConnectivityIntegrationTest was reborn as DatasetRecServiceCrossDatabaseIntegrationTest
The Exposed framework was removed for setting up test data, to avoid having yet another tool to figure out, instead allowing use of Flyway programmatically as in https://github.com/ThoughtWorks-SEA/recce/blob/eaf61c7c59774d3451a1d6e9348e8014928c6786/src/test/kotlin/recce/server/dataset/datasource/FlywayUtil.kt#L11-L26
- Exposed didn't really enough flexibility to define raw types and do things programmatically to generate tests
- Flyway is far more manual, but at least is an existing tool in use. Since you have to write SQ
The test runs this matrix currently, using the lowest common denominator of ANSI types that are understood by both DBs https://github.com/ThoughtWorks-SEA/recce/blob/eaf61c7c59774d3451a1d6e9348e8014928c6786/src/test/kotlin/recce/server/dataset/DatasetRecServiceCrossDatabaseIntegrationTest.kt#L31-L50

Things observed and still to consider

BOOLEAN or BIT types can still hash unequally across DBs due to driver and DB type differences (comes back as ByteBuffer on some drivers when using BIT type, and Boolean on others). This seems a hard problem in general. Unless we hash everything as raw bytes and add to the hash "minimal bytes to represent the number of type x" instead of the current approach, we will have these issues.
It's generally nonsense to try and hash and compare REAL/DECIMAL etc, just as it is in regular code to compare for equality. While the code can hash them, maybe it shouldn't. Or should warn you, because the hashes are unlikely to match due to either precision inside the DB being slightly different, or precision being slightly different when represented as Java types in memory.
Might be good to add ability to compare DB1(value A of type X) with DB2(value A of type Y)

chadlwilson added the task General technical task label Oct 25, 2021

chadlwilson added the size:M medium items label Nov 9, 2021

chadlwilson self-assigned this Nov 22, 2021

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Test across HashingStrategy

85ad541

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Rename HashedRow to HashingStrategy to better indicate its purpose

3707974

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Separate HashingStrategy from HashedRow

3c25dca

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Consider null columns of different types equal in TypeLenient str…

6a9c4e7

…ategy

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Make HashingStrategy configurable per dataset

2a6e779

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Add ability do set default HashingStrategy globally, and override…

6425ffc

… for individual datasets

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Set up integration test to run against pairs of databases

ff0c426

Allows us to start running real reconciliations against the two databases

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Workaround issue with MSSQL metadata including an extra metadata …

e55bc3b

…column See r2dbc/r2dbc-mssql#235 - this should be removed with r2dbc-mssql `0.8.8` or `0.9`+

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Run actual reconciliations between all combinations of database t…

28973bc

…ypes Rather than running test/dummy connectivity queries with R2DBC, run actual reconciliations here

chadlwilson added a commit that referenced this issue Nov 29, 2021

#39 Add testing for MariaDB compatibility

c4ddfaf

chadlwilson added a commit that referenced this issue Nov 30, 2021

#39 Factor out the types to reconcile across DBs

96b9d6e

chadlwilson added a commit that referenced this issue Nov 30, 2021

#39 Use MariaDB 10.5 to remove Flyway warning

d4edaa2

chadlwilson added a commit that referenced this issue Nov 30, 2021

#39 Start the test containers in parallel

d487284

chadlwilson added a commit that referenced this issue Nov 30, 2021

#39 Use the testcontainers native way to disable tests when there is …

4567d60

…no Docker

chadlwilson added a commit that referenced this issue Nov 30, 2021

#39 Stop test containers immediately after the test

bb8d948

This avoids them hanging around longer than currently needed

chadlwilson added a commit that referenced this issue Nov 30, 2021

#39 Rename test to be more consistent with its purpose

e4054c1

chadlwilson added a commit that referenced this issue Nov 30, 2021

#39 Factor out Flyway test data setup for reuse

7599111

chadlwilson closed this as completed Dec 1, 2021

chadlwilson mentioned this issue Dec 6, 2021

Improve experience for comparing floating point/decimal numbers #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create suite of sanity tests across DBs and types for equivalency #39

Create suite of sanity tests across DBs and types for equivalency #39

chadlwilson commented Oct 25, 2021 •

edited

Loading

chadlwilson commented Dec 1, 2021

Create suite of sanity tests across DBs and types for equivalency #39

Create suite of sanity tests across DBs and types for equivalency #39

Comments

chadlwilson commented Oct 25, 2021 • edited Loading

chadlwilson commented Dec 1, 2021

chadlwilson commented Oct 25, 2021 •

edited

Loading