You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Something we do frequently is to compare two tables (like one in dev, one in prod and you want to be sure there isn't any regression).
What's your use case?
1/I'm developing a pipeline in TEST and I want to compare it to PROD. There may be some normal differences but globally should be the same.
2/I'm migrating my server. I want to compare all my old repository database to the new one.
3/I have some specs with field name and type and I want to be sure I didn"t forgot any fields and that they are correctly typed
Objectives :
-accurate and actionable
-fast
-works for any comparison : on the same database but also with files or separate databases
-ability to deal with data qality issues : duplicate, last decimal, etc
there are some tools such as the SAS Proc compare or the Alteryx Expect equal but less detailed.
What's your proposed solution?
-two modes : dataframes versus batch
*dataframes : two inputs, two dataframes.
*batch : one input with several row, on each row the pair of comparison.
detailed below
===========> probably 2 different tools but it can use some common code
-two main algos :
*IUGF (Identify/Union/Group-by/Filter)
Step 1 : union of table/dataframe with identification of origin (the origin field must be changeable). We use a integer/byte with 0/1 value.
Step 2 : Group by on every field (except the origin field), with min/max/count agregation on the origin field (count is optional)
Step3 : Filter when the min=max (and for duplicate when count>2)
We can then have all the anomalies
*a mix of hash of each rows (with a warning label about hashing collision) for fields+IUGF
"join" algo is excluded for different reasons :
• Requires a key for join, and you don't always have a key
• bad detection of duplicate
• not easy to spot the difference
• risk of doing a cartesian product if not controlled.
====> This algo can been applied on most level.
-ability to optionnally provide keys
Providing keys allows some deeper analysis
for batch : in the batch format
for dataframe : on the tool
-several levels of comparison
The user can choose several. Since comparison take time and you don't always need the detail, it's a good way to adapt.
*table/query/view or file level (batch only) : in batch mode, it checks that the tables or the files are the same
*partition level (batch only, in-db only)
*metadata level. Field Name, optionnally type.
*table/file aggregation level : check that some statistics are the same (min/max/count/count distinct of each field)
*row level (example above)
*field level (if keys specified) => by pivoting on keys+casting all fields to string/text/varchar. Two outputs : one for the detail and one with an agregation (there is a counting of row and count distinct of keys) concerned)
==> choice on tool
-ability to select AND exclude fields
for batch : in the batch format
for dataframe : on the tool
-ability to force a number of decimal for float and double
for batch : in the batch format
for dataframe : on the tool
-ability to force some formula
(especially in batch mode, less useful in dataframe mode where it can be done before the comparison)
batch format (to be completed)
path (a file or a folder) or db connection+table data 1|where clause 1|path (a file or a folder) or db connection+table data 2|where clause 2|fields to select (* accepted)|fields to exclude in 1|fields to exclude in 2|rounding decimal (optional)|some formula for 1 (in json/xml..)|some formula for 2|key fields (optional)
on the batch tool, the user can map his "batch dataframe" to the expected format.
###output :
There are several output, one by level of comparison.
On dataframe mode=> one dataframe by output
On batch mode =>it takes the "path for result" and write in it for detailed results. It can be either a folder path (we would write one file / level of comparison) .
+a dataframe taking the input dataframe and one column by control with a status : OK KO or N/A (if not computed)
some examples of output with an alteryx draft in dataframe mode
rows difference :
metadata differences (note the flag origin should be 0, not true)
field detailed difference (the colA field is the key)
the field agregated difference (tjere should also be a count of row)
some other examples of a qlik sense draft in batch mode
Are there any alternative solutions?
trying to reproduce that manually with tools but not easy
Best regards
Simon
The text was updated successfully, but these errors were encountered:
Hello,
Something we do frequently is to compare two tables (like one in dev, one in prod and you want to be sure there isn't any regression).
What's your use case?
1/I'm developing a pipeline in TEST and I want to compare it to PROD. There may be some normal differences but globally should be the same.
2/I'm migrating my server. I want to compare all my old repository database to the new one.
3/I have some specs with field name and type and I want to be sure I didn"t forgot any fields and that they are correctly typed
Objectives :
-accurate and actionable
-fast
-works for any comparison : on the same database but also with files or separate databases
-ability to deal with data qality issues : duplicate, last decimal, etc
there are some tools such as the SAS Proc compare or the Alteryx Expect equal but less detailed.
What's your proposed solution?
-two modes : dataframes versus batch
*dataframes : two inputs, two dataframes.
*batch : one input with several row, on each row the pair of comparison.
detailed below
===========> probably 2 different tools but it can use some common code
-two main algos :
*IUGF (Identify/Union/Group-by/Filter)
Step 1 : union of table/dataframe with identification of origin (the origin field must be changeable). We use a integer/byte with 0/1 value.
Step 2 : Group by on every field (except the origin field), with min/max/count agregation on the origin field (count is optional)
Step3 : Filter when the min=max (and for duplicate when count>2)
We can then have all the anomalies
*a mix of hash of each rows (with a warning label about hashing collision) for fields+IUGF
"join" algo is excluded for different reasons :
• Requires a key for join, and you don't always have a key
• bad detection of duplicate
• not easy to spot the difference
• risk of doing a cartesian product if not controlled.
====> This algo can been applied on most level.
-ability to optionnally provide keys
Providing keys allows some deeper analysis
for batch : in the batch format
for dataframe : on the tool
-several levels of comparison
The user can choose several. Since comparison take time and you don't always need the detail, it's a good way to adapt.
*table/query/view or file level (batch only) : in batch mode, it checks that the tables or the files are the same
*partition level (batch only, in-db only)
*metadata level. Field Name, optionnally type.
*table/file aggregation level : check that some statistics are the same (min/max/count/count distinct of each field)
*row level (example above)
*field level (if keys specified) => by pivoting on keys+casting all fields to string/text/varchar. Two outputs : one for the detail and one with an agregation (there is a counting of row and count distinct of keys) concerned)
==> choice on tool
-ability to select AND exclude fields
for batch : in the batch format
for dataframe : on the tool
-ability to force a number of decimal for float and double
for batch : in the batch format
for dataframe : on the tool
-ability to force some formula
(especially in batch mode, less useful in dataframe mode where it can be done before the comparison)
batch format (to be completed)
path (a file or a folder) or db connection+table data 1|where clause 1|path (a file or a folder) or db connection+table data 2|where clause 2|fields to select (* accepted)|fields to exclude in 1|fields to exclude in 2|rounding decimal (optional)|some formula for 1 (in json/xml..)|some formula for 2|key fields (optional)
on the batch tool, the user can map his "batch dataframe" to the expected format.
###output :
There are several output, one by level of comparison.
On dataframe mode=> one dataframe by output
On batch mode =>it takes the "path for result" and write in it for detailed results. It can be either a folder path (we would write one file / level of comparison) .
+a dataframe taking the input dataframe and one column by control with a status : OK KO or N/A (if not computed)
some examples of output with an alteryx draft in dataframe mode
rows difference :
metadata differences (note the flag origin should be 0, not true)
field detailed difference (the colA field is the key)
the field agregated difference (tjere should also be a count of row)
some other examples of a qlik sense draft in batch mode
Are there any alternative solutions?
trying to reproduce that manually with tools but not easy
Best regards
Simon
The text was updated successfully, but these errors were encountered: