Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison set of tools #6918

Open
simonaubertbd opened this issue Oct 16, 2024 · 0 comments
Open

Comparison set of tools #6918

simonaubertbd opened this issue Oct 16, 2024 · 0 comments
Assignees

Comments

@simonaubertbd
Copy link

Hello,

Something we do frequently is to compare two tables (like one in dev, one in prod and you want to be sure there isn't any regression).

What's your use case?
1/I'm developing a pipeline in TEST and I want to compare it to PROD. There may be some normal differences but globally should be the same.

2/I'm migrating my server. I want to compare all my old repository database to the new one.

3/I have some specs with field name and type and I want to be sure I didn"t forgot any fields and that they are correctly typed

Objectives :
-accurate and actionable
-fast
-works for any comparison : on the same database but also with files or separate databases
-ability to deal with data qality issues : duplicate, last decimal, etc

there are some tools such as the SAS Proc compare or the Alteryx Expect equal but less detailed.

What's your proposed solution?

-two modes : dataframes versus batch

*dataframes : two inputs, two dataframes.
*batch : one input with several row, on each row the pair of comparison.
detailed below

===========> probably 2 different tools but it can use some common code

-two main algos :

*IUGF (Identify/Union/Group-by/Filter)
Step 1 : union of table/dataframe with identification of origin (the origin field must be changeable). We use a integer/byte with 0/1 value.
image

Step 2 : Group by on every field (except the origin field), with min/max/count agregation on the origin field (count is optional)

Step3 : Filter when the min=max (and for duplicate when count>2)

We can then have all the anomalies
image

*a mix of hash of each rows (with a warning label about hashing collision) for fields+IUGF

"join" algo is excluded for different reasons :
• Requires a key for join, and you don't always have a key
• bad detection of duplicate
• not easy to spot the difference
• risk of doing a cartesian product if not controlled.

====> This algo can been applied on most level.

-ability to optionnally provide keys

Providing keys allows some deeper analysis
for batch : in the batch format
for dataframe : on the tool

-several levels of comparison

The user can choose several. Since comparison take time and you don't always need the detail, it's a good way to adapt.
*table/query/view or file level (batch only) : in batch mode, it checks that the tables or the files are the same
*partition level (batch only, in-db only)
*metadata level. Field Name, optionnally type.
*table/file aggregation level : check that some statistics are the same (min/max/count/count distinct of each field)
*row level (example above)
*field level (if keys specified) => by pivoting on keys+casting all fields to string/text/varchar. Two outputs : one for the detail and one with an agregation (there is a counting of row and count distinct of keys) concerned)
==> choice on tool

-ability to select AND exclude fields

for batch : in the batch format
for dataframe : on the tool

-ability to force a number of decimal for float and double

for batch : in the batch format
for dataframe : on the tool

-ability to force some formula

(especially in batch mode, less useful in dataframe mode where it can be done before the comparison)

batch format (to be completed)

path (a file or a folder) or db connection+table data 1|where clause 1|path (a file or a folder) or db connection+table data 2|where clause 2|fields to select (* accepted)|fields to exclude in 1|fields to exclude in 2|rounding decimal (optional)|some formula for 1 (in json/xml..)|some formula for 2|key fields (optional)

on the batch tool, the user can map his "batch dataframe" to the expected format.

###output :
There are several output, one by level of comparison.

On dataframe mode=> one dataframe by output
On batch mode =>it takes the "path for result" and write in it for detailed results. It can be either a folder path (we would write one file / level of comparison) .
+a dataframe taking the input dataframe and one column by control with a status : OK KO or N/A (if not computed)

some examples of output with an alteryx draft in dataframe mode

rows difference :
image

metadata differences (note the flag origin should be 0, not true)
image

field detailed difference (the colA field is the key)
image

the field agregated difference (tjere should also be a count of row)
image

some other examples of a qlik sense draft in batch mode
image

Are there any alternative solutions?
trying to reproduce that manually with tools but not easy

Best regards

Simon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants