Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework for reporting data issues #14

Closed
stefpiatek opened this issue Mar 1, 2023 · 1 comment
Closed

Framework for reporting data issues #14

stefpiatek opened this issue Mar 1, 2023 · 1 comment

Comments

@stefpiatek
Copy link
Collaborator

stefpiatek commented Mar 1, 2023

should create a framework for checks of the merged data before running the reports which logs to the stdout, but also generates an html report table that helps identify the error

the report tables should be something like this, maybe with a high level and a low level version

test description expected result actual result pass/fail priorirty
0 check that input data columns match expected columns should be found expected rred_example_column not found FAIL HIGH
1 check that no extra columns are present no novel columns should exist unexpected column RRED_example_column present PASS
pupil_no school field issue
0 AS82827_1 Walden Road Primary date_of_birth improbable value of 2022-01-18
1 JK92817_2 Bigginson Primary School exit_date missing value

May be able to just use pandas with an html template rather than messing aroung with jinja2 templating. Prototype which we would make more production ready but gives an idea

import pandas as pd

top_level_checks = pd.DataFrame({
    "test description": [
        "check that input data columns match",
        "check that no extra columns are present"
    ],
    "expected result": [
        "expected columns should be found",
        "no novel columns should exist"
    ],
    "actual result": [
        "expected `rred_example_column` not found",
        "unexpected column `RRED_example_column` present"
    ],
    "pass/fail": [
        "FAIL",
        "PASS"
    ],
    "priorirty": [
        "HIGH",
        ""
    ]
})


low_level_checks = pd.DataFrame({
    "pupil_no": [
        "AS82827_1",
        "JK92817_2",
    ],
    "school": [
        "Walden Road Primary",
        "Bigginson Primary School"
    ],
    "field": [
        "date_of_birth",
        "exit_date"
    ],
    "issue": [
        "improbable value of `2022-01-18`",
        "missing value"
    ],
})


html_template = """
<!doctype html>
<html lang="en">
  <head>
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <!-- Bootstrap CSS -->
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" 
integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">  
    <title>{header}</title>
  </head>
  <body>
    <h1>Top level issues</h1>
    {top_level}
    <h1>Low level issues</h1>
    {low_level}
  </body>
</html>.
"""

with open("text.html", "w") as handle:
    handle.write(html_template.format(
        header="RRED UAT report",
        top_level=top_level_checks.to_html(classes="table table-striped"),
        low_level=low_level_checks.to_html(classes="table table-striped")
    ))

Image

@stefpiatek stefpiatek changed the title Data validation test with tests cases Framework for reporting data issues Mar 1, 2023
@stefpiatek
Copy link
Collaborator Author

Used a different approach in #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant