Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail fast or warn user if queries have mismatched column numbers #56

Open
chadlwilson opened this issue Nov 9, 2021 · 4 comments
Open
Labels
enhancement New feature or request size:XS Extra small items

Comments

@chadlwilson
Copy link
Collaborator

chadlwilson commented Nov 9, 2021

Context / Goal

If the two queries expressed in a dataset have different numbers of columns, they cannot possibly produce matches, when doing a hash-based comparison.

We should ideally fail fast, or at the very least warn the user clearly in the results somehow.

Currently we do not do any query parsing at startup, as this is entirely delegated to the runtime drivers for the relevant databases. We also do not want to introduce a startup connectivity dependency on an given datasource.

Expected Outcome

Evaluate and implement some approach to addressing this

  • is there a simple way to do basic parsing of SQL queries at startup to at least count the number of columns?
    • if people get fancy with database specific things like pivots and collations (not sure why) this would likely break down
  • easiest solution (but least performant) is to just run the whole rec, and when calculating/storing metadata check at this point. Warnings could just be inferred and attached to the returned run from the API. e.g something like the below (or at top level)
    "summary": {
         "bothMatched": 0,
         "bothMismatched": 3,
         "sourceOnly": 0,
         "sourceTotal": 3,
         "targetOnly": 0,
         "targetTotal": 3,
         "total": 3,
         "warnings": [ "source query has different number of columns to target query. This is guaranteed to produce 100% mismatches." ]
     },
  • next easiest would be to abort the rec after checking the first row of the target dataset. This would have implications for Interleave persistence of source + target rows #41 however, and makes things a bit more complex, as the target parsing needs to understand something about the source parsing.

Out of Scope

Additional context / implementation notes

@jiawen-tw
Copy link
Contributor

To decide on which option would we like to go with:

Option 1: Basic parsing of SQL query

  • There are sql parser libraries like https://github.com/JSQLParser/JSqlParser (though use of PIVOT would still fail)
  • Pros
    • Fastest failure, no need to establish any connection to source and target databases if fail
  • Cons
    • Support for various vendor-specific syntax is dependent one what is supported by parser library

Option 2: Run whole reconciliation run

  • After running the whole reconciliation run, compare number of columns in metadata, append mismatched columns as warning
  • Pros
    • Easy to implement
  • Cons
    • The entire run has to complete first before it would fail, which can waste a long time for large datasets

Option 3: After completing the run on source database, do the comparison with first row of target database

  • Finish the all the steps on the source database (including saving of hashed data)
  • Pros
    • Earlier failure than option 2
  • Cons
    • Later failure than option 1
    • Hashing and saving unused data for source database

Option 4: Get the first element from both sources for validation, then all elements for reconciliation

  • Get the first element from both sources to compare their metadata, and continue with the reconciliation if the number of columns match,
  • Pros
    • Fastest failure
    • No saving of hashed data to database yet
  • Cons
    • It might involve having a blocking statement in order to retrieve the first element from both sources to compare (is there a better way?)

Option 5: Build a separate api for validation of the query

  • User can run the validation api to ensure that their queries are correct before inserting into the application config
  • Pros:
    • Fast feedback on query’s correctness
  • Cons:
    • Need to build a separate api
    • Up to the user to check and ensure their query is valid before running the reconciliation run

@tommi-lew
Copy link
Contributor

tommi-lew commented Mar 24, 2022

Sharing my thoughts here:

  • Option 4 and 5 appears to be more elegant solutions as compared Option 2 and 3. Inclined to explore options 1, 4 and 5 further.

  • Do we want to strike out Option 1 or it is still worth evaluating? It appears to be a good solution if vendor-specific syntax does not cause much complexities.

  • Option 5 seems like a good candidate too. We could add a flag into the DB to indicate that the validation has executed, and during the recon run, verify the state if the validation. If the validation was not executed, do not continue, and if the validation did execute, continue to recon. And with this, issue Interleave persistence of source + target rows #41 should no longer be a concerned .

  • Option 4 still looks worth evaluating. But I suspect I'm missing something critical considering my lack of understanding with reactive programming. Also, the complexities with issue Interleave persistence of source + target rows #41 as mentioned.

@tommi-lew
Copy link
Contributor

tommi-lew commented Apr 11, 2022

Suggestion

  • Building on top of Option 4, add a condition to limit to 1 row suggested by @ThaoDang

@tommi-lew
Copy link
Contributor

tommi-lew commented Apr 14, 2022

How databases supported by r2dbc specify number of records to return:

  1. Google Cloud Spanner - LIMIT. Refer to Google Standard SQL.
  2. Oracle - No.
    1. Pre-version 12c: ROWNUM together with WHERE
    2. From version 12c onwards: Row Limiting Clause
  3. H2 - LIMIT
  4. Maria DB - LIMIT
  5. MS SQL - No. Uses SELECT TOP
  6. PostgreSQL - LIMIT

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XS Extra small items
Projects
None yet
Development

No branches or pull requests

3 participants