Comparison set of tools #6918

simonaubertbd · 2024-10-16T15:20:29Z

Hello,

Something we do frequently is to compare two tables (like one in dev, one in prod and you want to be sure there isn't any regression).

What's your use case?
1/I'm developing a pipeline in TEST and I want to compare it to PROD. There may be some normal differences but globally should be the same.

2/I'm migrating my server. I want to compare all my old repository database to the new one.

3/I have some specs with field name and type and I want to be sure I didn"t forgot any fields and that they are correctly typed

Objectives :
-accurate and actionable
-fast
-works for any comparison : on the same database but also with files or separate databases
-ability to deal with data qality issues : duplicate, last decimal, etc

there are some tools such as the SAS Proc compare or the Alteryx Expect equal but less detailed.

What's your proposed solution?

-two modes : dataframes versus batch

*dataframes : two inputs, two dataframes.
*batch : one input with several row, on each row the pair of comparison.
detailed below

===========> probably 2 different tools but it can use some common code

-two main algos :

*IUGF (Identify/Union/Group-by/Filter)
Step 1 : union of table/dataframe with identification of origin (the origin field must be changeable). We use a integer/byte with 0/1 value.

Step 2 : Group by on every field (except the origin field), with min/max/count agregation on the origin field (count is optional)

Step3 : Filter when the min=max (and for duplicate when count>2)

We can then have all the anomalies

*a mix of hash of each rows (with a warning label about hashing collision) for fields+IUGF

"join" algo is excluded for different reasons :
• Requires a key for join, and you don't always have a key
• bad detection of duplicate
• not easy to spot the difference
• risk of doing a cartesian product if not controlled.

====> This algo can been applied on most level.

-ability to optionnally provide keys

Providing keys allows some deeper analysis
for batch : in the batch format
for dataframe : on the tool

-several levels of comparison

The user can choose several. Since comparison take time and you don't always need the detail, it's a good way to adapt.
*table/query/view or file level (batch only) : in batch mode, it checks that the tables or the files are the same
*partition level (batch only, in-db only)
*metadata level. Field Name, optionnally type.
*table/file aggregation level : check that some statistics are the same (min/max/count/count distinct of each field)
*row level (example above)
*field level (if keys specified) => by pivoting on keys+casting all fields to string/text/varchar. Two outputs : one for the detail and one with an agregation (there is a counting of row and count distinct of keys) concerned)
==> choice on tool

-ability to select AND exclude fields

for batch : in the batch format
for dataframe : on the tool

-ability to force a number of decimal for float and double

for batch : in the batch format
for dataframe : on the tool

-ability to force some formula

(especially in batch mode, less useful in dataframe mode where it can be done before the comparison)

batch format (to be completed)

on the batch tool, the user can map his "batch dataframe" to the expected format.

###output :
There are several output, one by level of comparison.

On dataframe mode=> one dataframe by output
On batch mode =>it takes the "path for result" and write in it for detailed results. It can be either a folder path (we would write one file / level of comparison) .
+a dataframe taking the input dataframe and one column by control with a status : OK KO or N/A (if not computed)

some examples of output with an alteryx draft in dataframe mode

rows difference :

metadata differences (note the flag origin should be 0, not true)

field detailed difference (the colA field is the key)

the field agregated difference (tjere should also be a count of row)

some other examples of a qlik sense draft in batch mode

Are there any alternative solutions?
trying to reproduce that manually with tools but not easy

Best regards

Simon

janezd assigned BlazZupan Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison set of tools #6918

Comparison set of tools #6918

simonaubertbd commented Oct 16, 2024

Comparison set of tools #6918

Comparison set of tools #6918

Comments

simonaubertbd commented Oct 16, 2024

-two main algos :

-ability to optionnally provide keys

-several levels of comparison

-ability to select AND exclude fields

-ability to force a number of decimal for float and double

-ability to force some formula

batch format (to be completed)