Add dataframe validation before stage execution #197

Nitnelav · 2023-09-29T09:57:03Z

I think it would be a good idea to use Pandera to describe and check the input dataframes of a given stage at runtime.

It has the benefit of :

describing what the stage expects as inputs
make the code easier to read
make it easier to edit/replace a stage knowing what the depending stages expect
make it clear when contributions change the expected format

I don't think it can or should be be imposed in every existing stage but it can be strongly encouraged by the community.

For exemple :

import pandas as pd
import pandera as pa
import numpy as np
import data.hts.hts as hts

"""
This stage cleans the Loire Atlantique EDGT.
"""

def configure(context):
    context.stage("data.hts.edgt_44.raw")

PURPOSE_MAP = {
    "home": [1, 2],
    "work": [11, 12, 13, 81],
    "education": [21, 22, 23, 24, 25, 26, 27, 28, 29],
    "shop": [30, 31, 32, 33, 34, 35, 82],
    "leisure": [51, 52, 53, 54],
    "other": [41, 42, 43, 44, 45, 61, 62, 63, 64, 71, 72, 73, 74, 91]
}

MODES_MAP = {
    "car": [13, 15, 21, 81],
    "car_passenger": [14, 16, 22, 82],
    "pt": [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 51, 52, 53, 61, 71, 72, 73, 91, 92, 94, 95],
    "bike": [11, 17, 12, 18, 93, 19],
    "walk": [1, 2] # Actually, 2 is not really explained, but we assume it is walk
}

# expected input formats
HOUSEHOLDS_SCHEMA = pa.DataFrameSchema({
    'MTIR': pa.Column(object),
    'MP2': pa.Column(object),
    'ECH': pa.Column(object),
    'M5': pa.Column(np.int32),
    'M6': pa.Column(np.int32),
    'M7': pa.Column(np.int32),
    'COEM': pa.Column(float)
})
PERSONS_SCHEMA = pa.DataFrameSchema({
    "PTIR": pa.Column(object),
    "PP2": pa.Column(object),
    "ECH": pa.Column(object),
    "PER": pa.Column(np.int32),
    "P1": pa.Column(np.int32),
    "P2": pa.Column(np.int32),
    "P3": pa.Column(np.int32),
    "P4": pa.Column(np.int32),
    "P5": pa.Column(object, nullable=True),
    "P7": pa.Column(object, nullable=True),
    "P9": pa.Column(object, nullable=True),
    "P12": pa.Column(object, nullable=True),
    "COEP": pa.Column(float),
    "COEQ": pa.Column(float)
})
TRIPS_SCHEMA = pa.DataFrameSchema({
    "DTIR": pa.Column(object),
    "DP2": pa.Column(object),
    "ECH": pa.Column(object),
    "PER": pa.Column(np.int32),
    "NDEP": pa.Column(np.int32),
    "D2A": pa.Column(np.int32),
    "D3": pa.Column(object),
    "D4A": pa.Column(np.int32),
    "D4B": pa.Column(np.int32),
    "D5A": pa.Column(np.int32),
    "D7": pa.Column(object),
    "D8A": pa.Column(np.int32),
    "D8B": pa.Column(np.int32),
    "D8C": pa.Column(np.int32),
    "MODP": pa.Column(np.int32),
    "DOIB": pa.Column(np.int32),
    "DIST": pa.Column(np.int32)
})

def execute(context):
    df_households, df_persons, df_trips = context.stage("data.hts.edgt_44.raw")

    # check expected input formats
    df_households = HOUSEHOLDS_SCHEMA.validate(df_households)
    df_persons = PERSONS_SCHEMA.validate(df_persons)
    df_trips = TRIPS_SCHEMA.validate(df_trips)
   
   ...

    return df_households, df_persons, df_trips

sebhoerl · 2023-09-29T13:42:05Z

Very nice, I'm currently experimenting with snakemake to see if it might be good to switch to a pipeline tool with a large user base. Would be interesting to see if there is an integration that can check the format.
Independent of that we could even think of having some code somewhere that generates the schemas, like schemas.create_persons(additional = "income").validate(df_persons), with some standard attributes that need to be there plus optional ones if needed

Nitnelav · 2023-10-02T08:33:25Z

O_o snakemake looks quite interesting indeed ! joining a broader "pipeline" community would make a lot of sense.

regarding the 2nd point I think I would prefer defining everything inside the script but I see how that might lead to a certain amount of code duplication (if df_persons structure doesn't change much across many scripts for exemple...).

Nitnelav · 2024-11-13T11:57:22Z

FYI, I'm using pandera right now in another pipeline, and I find it very verbose if you want to validate the whole dataframe at every stage... I'll have a better opinion in a few weeks

Nitnelav mentioned this issue Sep 29, 2023

Use pandera dataframe validation eqasim-org/synpp#83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataframe validation before stage execution #197

Add dataframe validation before stage execution #197

Nitnelav commented Sep 29, 2023

sebhoerl commented Sep 29, 2023

Nitnelav commented Oct 2, 2023

Nitnelav commented Nov 13, 2024

Add dataframe validation before stage execution #197

Add dataframe validation before stage execution #197

Comments

Nitnelav commented Sep 29, 2023

sebhoerl commented Sep 29, 2023

Nitnelav commented Oct 2, 2023

Nitnelav commented Nov 13, 2024