Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datalineage is not correct using joins in gold layer(sparklistener) #11772

Open
lks-lks opened this issue Oct 31, 2024 · 0 comments
Open

Datalineage is not correct using joins in gold layer(sparklistener) #11772

lks-lks opened this issue Oct 31, 2024 · 0 comments
Labels
bug Bug report

Comments

@lks-lks
Copy link

lks-lks commented Oct 31, 2024

Describe the bug
A clear and concise description of what the bug is.

Using spark listener, when are performed a JOIN in gold layer the data lineage become 1:*

It's a little complex to explain but let's go.

Imagine the follow scenario:
is there a script in your gold layer like this:

from pyspark.sql.functions import expr
from spark.camadas import silverTempViews, writeGold, writeOracleDW
from spark.session import session

source = "db-name" # database_name
silverTempViews(source, [
'clientes','wifi_business','clientes_novo' #table_names
])

query = """
SELECT
CLIENTES.ID_CLIENTE AS ID_CLIENTE_ID,
CLIENTES.CPF_CNPJ AS CPF_CNPJ_CPF,
NOV.ID_BASE AS IDBASE_NOVO,
CLIENTES.NOME AS NOME_CLIENTE_NOME,
NEGOCIOS.ID_COD_PLANOPK AS COD_PK,
NEGOCIOS.ID_COD_SOLICITACAO AS ID_COD_SOL_WIFI,
CASE
WHEN CLIENTES.IS_PESSOA_FISICA = 1 THEN 'PF'
WHEN CLIENTES.ID_RAMO_ATIVIDADE = 4 THEN 'GOVERNAMENTAL'
WHEN CLIENTES.IS_PESSOA_FISICA = 0 THEN 'PJ'
ELSE ''
END AS CLASSIFICACAO_CLIENTE_CLASS,
CASE
WHEN CLIENTES.SITUACAO = '0' THEN 'ATIVO'
WHEN CLIENTES.SITUACAO = '1' THEN 'ARQUIVADO'
WHEN CLIENTES.SITUACAO = '2' THEN 'BLOQUEADO'
WHEN CLIENTES.SITUACAO = '3' THEN 'PENDENTE'
WHEN CLIENTES.SITUACAO = '4' THEN 'MANUTENÇÃO'
WHEN CLIENTES.SITUACAO = '5' THEN 'BLOQUADO (PROTESTADO)'
WHEN CLIENTES.SITUACAO = '6' THEN 'INVIABILIDADE'
WHEN CLIENTES.SITUACAO = '7' THEN 'PROSPECTO'
ELSE CLIENTES.SITUACAO
END AS SITUACAO_CLIENTE_SIT

    FROM CLIENTES

LEFT JOIN WIFI_BUSINESS NEGOCIOS
    ON NEGOCIOS.ID_COD_CLIENTE = CLIENTES.ID_CLIENTE
LEFT JOIN CLIENTES_NOVO NOV
    ON CLIENTES.ID_CLIENTE = NOV.ID_CLIENTE

"""

__session = session.get()
df = __session.sql(query)

indicador = "gold_teste_datahub"

writeOracleDW(df, indicador)

Note:

This sql reflects to the image about the issue.

Lets focus on the real part of problem:

LEFT JOIN WIFI_BUSINESS NEGOCIOS
    ON NEGOCIOS.ID_COD_CLIENTE = CLIENTES.ID_CLIENTE
LEFT JOIN CLIENTES_NOVO NOV
    ON CLIENTES.ID_CLIENTE = NOV.ID_CLIENTE

In this step we perform a left join usin the keys "ID_CLIENTE, ID_COD_CLIENTE,ID_CLIENTE"

After the join we can use the table fields to "call" the field we need and perfor the tranformation like this.

    CLIENTES.ID_CLIENTE                                         					        AS ID_CLIENTE_ID,
    CLIENTES.CPF_CNPJ                                           					        AS CPF_CNPJ_CPF,
    NOV.ID_BASE                                                                                                  AS IDBASE_NOVO,
    CLIENTES.NOME                                                 					        AS NOME_CLIENTE_NOME,
    NEGOCIOS.ID_COD_PLANOPK                                                                       AS COD_PK,
    NEGOCIOS.ID_COD_SOLICITACAO                                                                 AS ID_COD_SOL_WIFI,

So we expect who in the data lineage the field ID_CLIENTE(from table cliente silver_layer ) become ID_CLIENTE_ID ( gold_layer)

But observes the follow image:

image

The expectation is ONLY ONE LINE from ID_CLIENTE to ID_CLIENTE_ID

Like the IS_PESSOA_FISICA to CLASSIFICACAO_CLIENTE_CLASS
FIG 2:
image

Is there a way to fix this?

For knowledge this is the spark config:

    datahub_configs = {
        'spark.extraListeners': 'datahub.spark.DatahubSparkListener',
        'spark.datahub.rest.server': f"http://{__DATAHUB_SERVER_IP}:{__DATAHUB_SERVER_PORT}",
        'spark.datahub.flow_name': app,
        'spark.datahub.metadata.include_scheme' : 'false',
        'spark.datahub.metadata.dataset.materialize' : 'true',
        'spark.datahub.metadata.dataset.experimental_include_schema_metadata' :'true',
        'spark.datahub.lineage.captureColumnLevel' : 'true',
        'spark.datahub.coalesce_jobs' : 'true',
        'spark.datahub.file_partition_regexp' : '.*_delta_log.*', 
    }

Jar version:

image

Datahub cli version:

image

To Reproduce
Steps to reproduce the behavior:

  1. Have a script in gold layer
  2. Perform a join in script
  3. After running the task (sparklistener - airflow) verify the datalineage from the tables using the field in the JOIN
  4. Note the lines from datalineage become 1:*
  5. Now A---A,A---B,A---C
  6. Expected A---A,B---B,C---C

Expected behavior
Its expect a datalineage using JOIN but 1:1 like FIG 2.
Joins do not become datalineage ambiguous or datalineage 1:*.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

NAME="Oracle Linux Server"
VERSION="9.4"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Oracle Linux Server 9.4"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:9:4:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 9"
ORACLE_BUGZILLA_PRODUCT_VERSION=9.4
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=9.4

Additional context

A full data lineage is expected using join and 1:1

image

@lks-lks lks-lks added the bug Bug report label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

1 participant