Datalineage is not correct using joins in gold layer(sparklistener) #11772

lks-lks · 2024-10-31T20:14:28Z

Describe the bug
A clear and concise description of what the bug is.

Using spark listener, when are performed a JOIN in gold layer the data lineage become 1:*

It's a little complex to explain but let's go.

Imagine the follow scenario:
is there a script in your gold layer like this:

from pyspark.sql.functions import expr
from spark.camadas import silverTempViews, writeGold, writeOracleDW
from spark.session import session

source = "db-name" # database_name
silverTempViews(source, [
'clientes','wifi_business','clientes_novo' #table_names
])

query = """
SELECT
CLIENTES.ID_CLIENTE AS ID_CLIENTE_ID,
CLIENTES.CPF_CNPJ AS CPF_CNPJ_CPF,
NOV.ID_BASE AS IDBASE_NOVO,
CLIENTES.NOME AS NOME_CLIENTE_NOME,
NEGOCIOS.ID_COD_PLANOPK AS COD_PK,
NEGOCIOS.ID_COD_SOLICITACAO AS ID_COD_SOL_WIFI,
CASE
WHEN CLIENTES.IS_PESSOA_FISICA = 1 THEN 'PF'
WHEN CLIENTES.ID_RAMO_ATIVIDADE = 4 THEN 'GOVERNAMENTAL'
WHEN CLIENTES.IS_PESSOA_FISICA = 0 THEN 'PJ'
ELSE ''
END AS CLASSIFICACAO_CLIENTE_CLASS,
CASE
WHEN CLIENTES.SITUACAO = '0' THEN 'ATIVO'
WHEN CLIENTES.SITUACAO = '1' THEN 'ARQUIVADO'
WHEN CLIENTES.SITUACAO = '2' THEN 'BLOQUEADO'
WHEN CLIENTES.SITUACAO = '3' THEN 'PENDENTE'
WHEN CLIENTES.SITUACAO = '4' THEN 'MANUTENÇÃO'
WHEN CLIENTES.SITUACAO = '5' THEN 'BLOQUADO (PROTESTADO)'
WHEN CLIENTES.SITUACAO = '6' THEN 'INVIABILIDADE'
WHEN CLIENTES.SITUACAO = '7' THEN 'PROSPECTO'
ELSE CLIENTES.SITUACAO
END AS SITUACAO_CLIENTE_SIT

    FROM CLIENTES

LEFT JOIN WIFI_BUSINESS NEGOCIOS
    ON NEGOCIOS.ID_COD_CLIENTE = CLIENTES.ID_CLIENTE
LEFT JOIN CLIENTES_NOVO NOV
    ON CLIENTES.ID_CLIENTE = NOV.ID_CLIENTE

"""

__session = session.get()
df = __session.sql(query)

indicador = "gold_teste_datahub"

writeOracleDW(df, indicador)

Note:

This sql reflects to the image about the issue.

Lets focus on the real part of problem:

LEFT JOIN WIFI_BUSINESS NEGOCIOS
    ON NEGOCIOS.ID_COD_CLIENTE = CLIENTES.ID_CLIENTE
LEFT JOIN CLIENTES_NOVO NOV
    ON CLIENTES.ID_CLIENTE = NOV.ID_CLIENTE

In this step we perform a left join usin the keys "ID_CLIENTE, ID_COD_CLIENTE,ID_CLIENTE"

After the join we can use the table fields to "call" the field we need and perfor the tranformation like this.

    CLIENTES.ID_CLIENTE                                         					        AS ID_CLIENTE_ID,
    CLIENTES.CPF_CNPJ                                           					        AS CPF_CNPJ_CPF,
    NOV.ID_BASE                                                                                                  AS IDBASE_NOVO,
    CLIENTES.NOME                                                 					        AS NOME_CLIENTE_NOME,
    NEGOCIOS.ID_COD_PLANOPK                                                                       AS COD_PK,
    NEGOCIOS.ID_COD_SOLICITACAO                                                                 AS ID_COD_SOL_WIFI,

So we expect who in the data lineage the field ID_CLIENTE(from table cliente silver_layer ) become ID_CLIENTE_ID ( gold_layer)

But observes the follow image:

The expectation is ONLY ONE LINE from ID_CLIENTE to ID_CLIENTE_ID

Like the IS_PESSOA_FISICA to CLASSIFICACAO_CLIENTE_CLASS
FIG 2:

Is there a way to fix this?

For knowledge this is the spark config:

    datahub_configs = {
        'spark.extraListeners': 'datahub.spark.DatahubSparkListener',
        'spark.datahub.rest.server': f"http://{__DATAHUB_SERVER_IP}:{__DATAHUB_SERVER_PORT}",
        'spark.datahub.flow_name': app,
        'spark.datahub.metadata.include_scheme' : 'false',
        'spark.datahub.metadata.dataset.materialize' : 'true',
        'spark.datahub.metadata.dataset.experimental_include_schema_metadata' :'true',
        'spark.datahub.lineage.captureColumnLevel' : 'true',
        'spark.datahub.coalesce_jobs' : 'true',
        'spark.datahub.file_partition_regexp' : '.*_delta_log.*', 
    }

Jar version:

Datahub cli version:

To Reproduce
Steps to reproduce the behavior:

Have a script in gold layer
Perform a join in script
After running the task (sparklistener - airflow) verify the datalineage from the tables using the field in the JOIN
Note the lines from datalineage become 1:*
Now A---A,A---B,A---C
Expected A---A,B---B,C---C

Expected behavior
Its expect a datalineage using JOIN but 1:1 like FIG 2.
Joins do not become datalineage ambiguous or datalineage 1:*.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

NAME="Oracle Linux Server"
VERSION="9.4"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.4"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:4:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 9"
ORACLE_BUGZILLA_PRODUCT_VERSION=9.4
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=9.4

Additional context

A full data lineage is expected using join and 1:1

The text was updated successfully, but these errors were encountered:

lks-lks added the bug Bug report label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datalineage is not correct using joins in gold layer(sparklistener) #11772

Datalineage is not correct using joins in gold layer(sparklistener) #11772

lks-lks commented Oct 31, 2024

Datalineage is not correct using joins in gold layer(sparklistener) #11772

Datalineage is not correct using joins in gold layer(sparklistener) #11772

Comments

lks-lks commented Oct 31, 2024