Skip to content

Commit

Permalink
Added new invalid_email rule spec (#87)
Browse files Browse the repository at this point in the history
  • Loading branch information
yulric authored Dec 5, 2023
2 parents 28d80ea + b3098bc commit 44a61d8
Show file tree
Hide file tree
Showing 12 changed files with 171 additions and 184 deletions.
11 changes: 9 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# Contributing

The PHES-ODM validation tool kit is an open source and community-driven. You can make suggestions for new validation rules or comment on existing rules on the PHES-ODM [discussion board](https://odm.discourse.group) or GitHub [Issues](https://github.com/Big-Life-Lab/PHES-ODM-Validation/issues).
The PHES-ODM validation tool kit is an open source and community-driven. You can
make suggestions for new validation rules or comment on existing rules on the
PHES-ODM [discussion board](https://odm.discourse.group) or GitHub
[Issues](https://github.com/Big-Life-Lab/PHES-ODM-Validation/issues).

## Adding a new rule

New validation rules can be requested by anyone ODM user. Instructions on how to add a new rule is found in [/docs/validation-rules/README.md](/docs/validation-rules/README.md). The validation rules README.md is a good source of additional information about how rules work.
New validation rules can be requested by anyone ODM user. Instructions on how to
add a new rule can be found in the
[documentation](/rules.html#adding-a-new-rule). The validation rules
[README.md](/rules.html) is a good source of additional information about
how rules work.

## Code style

Expand Down
18 changes: 18 additions & 0 deletions assets/validation-rules/invalid-email/error-report-1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"errors": [
{
"errorType": "invalid_email",
"tableName": "contacts",
"columnName": "email",
"rowNumber": 1,
"row": {
"contactID": "1",
"email": "john.doe"
},
"invalidValue": "john.doe",
"validationRuleFields": [],
"message": "Invalid email john.doe found in row 1 for column email in table contacts"
}
],
"warnings": []
}
2 changes: 2 additions & 0 deletions assets/validation-rules/invalid-email/invalid-dataset-1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
contactID,email
1,john.doe
5 changes: 5 additions & 0 deletions assets/validation-rules/invalid-email/parts.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
partID,partType,sites,contacts,version1Location,version1Table,version1Variable
sites,tables,NA,NA,tables,Site,NA
geoLat,attributes,header,NA,variables,Site,Latitude
contacts,tables,NA,NA,tables,Contact,NA
email,attributes,NA,header,variables,Contact,contactEmail
22 changes: 22 additions & 0 deletions assets/validation-rules/invalid-email/schema-v1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
schemaVersion: '1.0.0'
schema:
Site:
type: list
schema:
type: dict
contactEmail:
is_email: true
meta:
- ruleID: invalid_email
meta:
- partID: email
partType: attributes
contacts: header
version1Location: variables
version1Table: Site
version1Variable: contactEmail
meta:
- partID: sites
partType: tables
version1Location: tables
version1Table: Site
17 changes: 17 additions & 0 deletions assets/validation-rules/invalid-email/schema-v2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
schemaVersion: '2.0.0'
schema:
sites:
type: list
schema:
type: dict
email:
is_email: true
meta:
- ruleID: invalid_email
meta:
- partID: email
partType: attributes
contacts: header
meta:
- partID: sites
partType: tables
2 changes: 2 additions & 0 deletions assets/validation-rules/invalid-email/valid-dataset-1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
contactID,email
1,[email protected]
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
- partID: contactEmail
table: Lab
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
- partID: email
table: contacts
1 change: 1 addition & 0 deletions assets/validation-rules/validation-rules-list.csv
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ less_than_min_length,Validates the minimum length of a string type,The minLength
greater_than_max_length,Validates the maximum length of a string type,The maxLength column in the ODM dictionary documents the minimum that a part should have. This validation implements it.,error,Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> has length <invalid_length> which is greater than the max length of <max_length>,active,v1.0.0,,all,,
invalid_type,Validates type of a value,Uses the dataType column to check if a value is the correct type or can be coerced into the correct type,error,Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> has type <invalid_value_type> but should be of type <valid_type> or coercable into a <valid_type>.,,,,,,
invalid_type,Validates type of a value,Uses the dataType column to check if a value is the correct type or can be coerced into the correct type,error,Row <row_index> in column <column_name> in table <table_name> is a boolean but has value <invalid_value>. Allowed values are <boolean_categories>,,,,,,
invalid_email,Validates an email column,The dictionary does not contain any metadata to describe if a column is an email or not. This rule hardcodes the email column within it.,error,Invalid email <invalidValue> found in row <rowIndex> for column <columnName> in table <tableName>,active,v1.0.0,,all,,
91 changes: 91 additions & 0 deletions docs/validation-rules/invalid_email.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# invalid_email

{{< include _setup.qmd >}}

```{python}
#| echo: false
ASSET_DIR = get_rule_asset_dir('invalid_email')
```

This rule validates that varchar columns that represent an email have a valid email address. For example, consider the `email` column in the `contacts` table. The following dataset snippet should fail validation,

```{python}
pprint_csv_file(asset("invalid-dataset-1.csv"), "Invalid dataset")
```

whereas the following should pass,

```{python}
pprint_csv_file(asset("valid-dataset-1.csv"), "Valid dataset")
```

## Error report

The error report will have the following fields

* **errorType**: invalid_email
* **tableName**: The name of the table whose row has the invalid email
* **columnName** The name of the column with the invalid email
* **rowNumber**: The index of the table row with the error
* **row** The row in the data that failed this validation rule
* **invalidValue**: The invalid email value
* **validationRuleFields**: The ODM data dictionary rule fields violated by this row
* **message**: Invalid email <invalidValue> found in row <rowIndex> for column <columnName> in table <tableName>

An example error report for the invalid dataset above is shown below,

```{python}
pprint_json_file(asset("error-report-1.json"))
```

## Rule metadata

The dictionary currently does not have any metadata to say if a column is an email or not. Instead we will be hardcoding this rule to a set of pre-determined email columns. For version 2 the email columns are:

```{python}
pprint_yaml_file(asset("version-2-email-columns.yaml"))
```

In the above file,

* The `partID` field contains the name of the email column and
* The `table` field contains the name of the table that the part is a column in.

If a parts sheet contains any of the above mentioned columns, then this validation rule should be added to them. For example, in the following parts sheet snippet this rule should be added to all columns except for `geoLat`.

```{python}
pprint_csv_file(asset("parts.csv"), title = "Parts v2", ignore_prefix = "version1")
```

## Cerberus schema

We will be using a custom rule called `is_email` to each column. Alternative appraoches and reasons for not using them are:

1. `type` rule: We would prefer to keep the value of this rule the same as the `dataType` column in the ODM
2. `regex` rule: Better than type but is less clear to a user of the schema what the regex is actually trying to validate.

Underneath the hood the `is_email` rule will be using a regex to validate the column value. An example of the regex can be seen in this [stack overflow thread](https://stackoverflow.com/a/201378/1950599). For the parts snippet above the following schema should be generated,

```{python}
pprint_yaml_file(asset("schema-v2.yaml"))
```

## Version 1

For version 1 schemas, we add this rule to the version 1 equivalents of the above mentioned version 2 email columns. In addition, this rule should also be added to the following version 1 only columns:

```{python}
pprint_yaml_file(asset("version-1-email-columns.yaml"))
```

For example, for the following version 1 parts snippet,

```{python}
pprint_csv_file(asset("parts.csv"), title = "Parts v1")
```

the following validation schema should be generated,

```{python}
pprint_yaml_file(asset("schema-v1.yaml"))
```
Loading

0 comments on commit 44a61d8

Please sign in to comment.