Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create suite of sanity tests across DBs and types for equivalency #39

Closed
chadlwilson opened this issue Oct 25, 2021 · 1 comment
Closed
Assignees
Labels
size:M medium items task General technical task

Comments

@chadlwilson
Copy link
Collaborator

chadlwilson commented Oct 25, 2021

Context / Goal

Since the tool is designed to help you validate migration of data across different schema types and even (relational) database implementations, and primarily works based on hashed data representing a row in a dataset, we want to have a way to validate that it is actually working correctly for this purpose.

Since there are potentially differences in the way drivers handle things such as character encodings, number types, timestamp/data types we want to ensure that the hashed data representing one data type in one database is considered hash-identical to that of another. If there is not, there should be a way to ensure they are so using something simple in SQL, or we should probably change our implementation.

Expected Outcome

  • Modify/refactor the MultiDataSourceConnectivityIntegrationTest from Configure integration test tooling to allow testing across DB types #28 to instead of just testing R2DBC connectivity via Micronaut, instead have a suite of simple scenarios that focus on real integration testing scenarios, focused on type differences
    • It is likely that the hashing impl in HashedRow is not going to work correctly. An int with value 10 in one DB will likely not be considered equal to a long with value 10 in another DB, and similar with other types. We will have to make some decisions about how this should work to canonicalize values, and how configurable it needs to be.
    • Should a string of "10" be considered equal to a bigint of 10?
  • Run simple, but real reconciliations that focus on ensuring that hashed values from one DB of a given type are equivalent to that of a different DB

Out of Scope

Additional context / implementation notes

  • At time of writing we are using the Exposed framework in test code to generate schemas for testing with. This may not give us the level of control over data types in the databases that we require, and may need to be re-evaluated.
  • Possibly these tests could run in a matrix style with simple data set queries on each side like SELECT id as MigrationKey, test_type_column FROM testdata, creating a single table with a single test data column per test
    • Dimension 1: DB (mysql, postgres, mssql)
    • Dimension 2: DB Type under test (CHAR/VARCHAR, INT/INTEGER, BIGINT, NUMERIC/DECIMAL/REAL/FLOAT, DATETIME/DATE/TIME/TIMESTAMP etc)
  • Types
@chadlwilson chadlwilson added the task General technical task label Oct 25, 2021
@chadlwilson chadlwilson added the size:M medium items label Nov 9, 2021
@chadlwilson chadlwilson self-assigned this Nov 22, 2021
chadlwilson added a commit that referenced this issue Nov 29, 2021
… types

In practice, different drivers return different types at runtime. The previous strategy is strict, i.e if one driver/DB column returns a 32-bit int and another returns a 64-bit long, they will be considered unequal, regardless of their values. This is probably not desirable. This introduces a strategy that allows a more lenient approach to hashing, which is most likely to be desirable, especially when comparing across DB types, and different schemas which might change the types at DB level for efficiency.
chadlwilson added a commit that referenced this issue Nov 29, 2021
chadlwilson added a commit that referenced this issue Nov 29, 2021
chadlwilson added a commit that referenced this issue Nov 29, 2021
Allows us to start running real reconciliations against the two databases
chadlwilson added a commit that referenced this issue Nov 29, 2021
…column

See r2dbc/r2dbc-mssql#235 - this should be removed with r2dbc-mssql `0.8.8` or `0.9`+
chadlwilson added a commit that referenced this issue Nov 29, 2021
…ypes

Rather than running test/dummy connectivity queries with R2DBC, run actual reconciliations here
chadlwilson added a commit that referenced this issue Nov 30, 2021
This is more flexible as we will likely need to define the schema programmatically and dynamically to make the tests super-flexible and easy to debug. Exposed is more of a compile-time thing, so likely not really what we need for this type of testing of data types.
chadlwilson added a commit that referenced this issue Nov 30, 2021
Currently we only test with the types which can be declared with a common type alias across the four databases.
Challenges
* MSSQL/SQL server does not have a BOOL/BOOLEAN type
* BIT type is handled different across R2DBC drivers. Some return a BitSet or Boolean, some a ByteBuffer
* Comparing floating point numbers doesn't really make sense. They tend to differ and fail the test. Might need to recommend against trying to compare such types, or introduce a "warning" system that will help people debug these.
chadlwilson added a commit that referenced this issue Nov 30, 2021
This avoids them hanging around longer than currently needed
chadlwilson added a commit that referenced this issue Nov 30, 2021
Use of Exposed was not really helping that much. For now, easier and more consistent to use Flyway everywhere, and move to something
else that allows SQL generation later if necessary. One less thing for people to learn for now.
@chadlwilson
Copy link
Collaborator Author

This is done now.

As part of this

Things observed and still to consider

  • BOOLEAN or BIT types can still hash unequally across DBs due to driver and DB type differences (comes back as ByteBuffer on some drivers when using BIT type, and Boolean on others). This seems a hard problem in general. Unless we hash everything as raw bytes and add to the hash "minimal bytes to represent the number of type x" instead of the current approach, we will have these issues.
  • It's generally nonsense to try and hash and compare REAL/DECIMAL etc, just as it is in regular code to compare for equality. While the code can hash them, maybe it shouldn't. Or should warn you, because the hashes are unlikely to match due to either precision inside the DB being slightly different, or precision being slightly different when represented as Java types in memory.
  • Might be good to add ability to compare DB1(value A of type X) with DB2(value A of type Y)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M medium items task General technical task
Projects
None yet
Development

No branches or pull requests

1 participant