diff --git a/CHANGELOG.md b/CHANGELOG.md index dedf557..cb4f299 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,21 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 This changelog was started for release 0.0.3. -## [0.0.3] - Unreleased +## [0.0.3] - 21/11/2022 ### Added -- empty_ok_if key for validator -- empty_ok_unless key for validator +- empty_ok_if key for validator & templates +- empty_ok_unless key for validator & templates - readme key for validator - unique key for validator - expected_rows key for templates - logs parameters for templates +- na_ok key for validators & templates +- skip_generation key for validators & templates +- skip_validation key for validators & templates ### Fixed - Bug for setValidator when using number values +- Fixed regex for GPS ### Changed - Better validation for integers +- Refactor validation in excel for most validators (to include unique & na_ok) diff --git a/README.md b/README.md index 7305b61..442b835 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ # Checkcel Checkcel is a generation & validation tool for CSV/ODS/XLSX/XLS files. -Basic validations (sets, whole, decimals, unicity, emails, dates) are included, but also ontologies validation. -(Using the [OLS API](https://www.ebi.ac.uk/ols/index)) +Basic validations (sets, whole, decimals, unicity, emails, dates, regex) are included, but also ontologies validation. +(Using the [OLS API](https://www.ebi.ac.uk/ols/index), and the [INRAE thesaurus](https://consultation.vocabulaires-ouverts.inrae.fr)) Checkcel works with either python templates or json/yml files for the generation and validation. Examples are available [here](https://github.com/mboudet/checkcel_templates) or in the [example folder](examples/). @@ -98,6 +98,7 @@ Checkcel( sheet="0" ).load_from_json_file(your_json_template_file).validate() +# You can access the logs from python with the 'logs' key of the Checkcel class ``` # Templates @@ -108,8 +109,12 @@ In all cases, you will need to at least include a list of validators and associa * *metadata*: A list of column names. This will create a metadata sheet with these columns, without validation on them * *expected_rows*: (Default 0): Number of *data* rows expected * *empty_ok* (Default False): Whether to accept empty values as valid -* *ignore_space* (Default False): whether to trim the values for spaces before checking validity -* *ignore_case* (Default False): whether to ignore the case +* *na_ok* (Default False): whether to allow NA (or n/a) values as valid +* *ignore_space* (Default False): whether to trim the values for spaces before checking validity in python +* *ignore_case* (Default False): whether to ignore the case (when relevant)before checking validity in python +* *skip_generation* (Default False): whether to skip the excel validation generation (for file generation) for all validators +* *skip_validation* (Default False): whether to skip the python validation for all validators +* *unique* (Default False): whether to require unicity for all validators The last 3 parameters will affect all the validators (when relevant), but can be overriden at the validator level (eg, you can set 'empty_ok' to True for all, but set it to False for a specific validator). @@ -155,7 +160,10 @@ All validators (except NoValidator) have these options available. If relevant, t * The dict keys must be column names, and the values lists of 'rejected values'. The current column will accept empty values if the related column's value is **not** in the list of reject values * *ignore_space* (Default False): whether to trim the values for spaces before checking validity * *ignore_case* (Default False): whether to ignore the case -* *unique* (Default False): whether to enforce unicity for this column. (Not enforced in excel yet, except if there are not other validation (ie TextValidator and RegexValidator in some cases)) +* *unique* (Default False): whether to enforce unicity for this column. (Not enforced in excel for 'Set-type' validators (set, linked-set, ontology, vocabulaireOuvert)) +* *na_ok* (Default False): whether to allow NA (or n/a) values as valid. +* *skip_generation* (Default False): whether to skip the excel validation for this validator (for file generation) +* *skip_validation* (Default False): whether to skip the python validation for this validator *As excel validation for non-empty values is unreliable, the non-emptiness cannot be properly enforced in excel files* @@ -163,58 +171,58 @@ All validators (except NoValidator) have these options available. If relevant, t * NoValidator (always True) * **No in-file validation generated** -* TextValidator(empty_ok=False) +* TextValidator(**kwargs) * **No in-file validation generated** (unless *unique* is set) -* IntValidator(min="", max="", empty_ok=False) +* IntValidator(min="", max="", **kwargs) * Validate that a value is an integer * *min*: Minimal value allowed * *max*: Maximal value allowed -* FloatValidator(min="", max="", empty_ok=False) +* FloatValidator(min="", max="", **kwargs) * Validate that a value is an float * *min*: Minimal value allowed * *max*: Maximal value allowed -* SetValidator(valid_values=[], empty_ok=False) +* SetValidator(valid_values=[], **kwargs) * Validate that a value is part of a set of allowed values * *valid_values*: list of valid values -* LinkedSetValidator(linked_column="", valid_values={}, empty_ok=False) +* LinkedSetValidator(linked_column="", valid_values={}, **kwargs) * Validate that a value is part of a set of allowed values, in relation to another column value. * Eg: Valid values for column C will be '1' or '2' if column B value is 'Test', else '3' or '4' * *linked_column*: Linked column name * *valid_values*: Dict with the *linked_column* values as keys, and list of valid values as values * Ex: {"Test": ['1', '2'], "Test2": ['3', '4']} -* EmailValidator(empty_ok=False) -* DateValidator(day_first=True, empty_ok=False, before=None, after=None) +* EmailValidator(**kwargs) +* DateValidator(day_first=True, before=None, after=None, **kwargs) * Validate that a value is a date. * *day_first* (Default True): Whether to consider the day as the first part of the date for ambiguous values. * *before* Latest date allowed * *after*: Earliest date allowed -* TimeValidator(empty_ok=False, before=None, after=None) +* TimeValidator(before=None, after=None, **kwargs) * Validate that a value is a time of the day * *before* Latest value allowed * *after*: Earliest value allowed -* UniqueValidator(unique_with=[], empty_ok=False) +* UniqueValidator(unique_with=[], **kwargs) * Validate that a column has only unique values. * *unique_with*: List of column names if you need a tuple of column values to be unique. * Ex: *I want the tuple (value of column A, value of column B) to be unique* -* OntologyValidator(ontology, root_term="", empty_ok=False) +* OntologyValidator(ontology, root_term="", **kwargs) * Validate that a term is part of an ontology, using the [OLS API](https://www.ebi.ac.uk/ols/index) for validation * *ontology* needs to be a short-form ontology name (ex: ncbitaxon) * *root_term* can be used if you want to make sure your terms are *descendants* of a specific term * (Should be used when generating validated files using big ontologies) -* VocabulaireOuvertValidator(root_term="", lang="en", labellang="en", vocab="thesaurus-inrae", empty_ok=False) +* VocabulaireOuvertValidator(root_term="", lang="en", labellang="en", vocab="thesaurus-inrae", **kwargs) * Validate that a term is part of the INRAE(default) or IRSTEA thesaurus * **No in-file validation generated** *unless using root_term* * *root_term*: Same as OntologyValidator. * *lang*: Language for the queried terms *(en or fr)* * *labellang*: Language for the queries returns (ie, the generated validation in files). Default to *lang* values. * *vocab*: Vocabulary used. Either 'thesaurus-inrae' or 'thesaurus-irstea'. -* GPSValidator(empty_ok=False, format="DD", only_long=False, only_lat=False) +* GPSValidator(format="DD", only_long=False, only_lat=False, **kwargs) * Validate that a term is a valid GPS cordinate * **No in-file validation generated** * *format*: Expected GPS format. Valid values are *dd* (decimal degrees, default value) or *dms* (degree minutes seconds) * *only_long*: Expect only a longitude * *only_lat*: Expect only a latitude -* RegexValidator(regex, excel_formulat="", empty_ok=False) +* RegexValidator(regex, excel_formulat="", **kwargs) * Validate that a term match a specific regex * **No in-file validation generated** *unless using excel_formula* * *excel_formula*: Custom rules for in-file validation. [Examples here](http://www.contextures.com/xlDataVal07.html). diff --git a/checkcel/checkerator.py b/checkcel/checkerator.py index 3936b00..92bd3dd 100644 --- a/checkcel/checkerator.py +++ b/checkcel/checkerator.py @@ -42,7 +42,7 @@ def generate(self): if isinstance(validator, OntologyValidator) or isinstance(validator, VocabulaireOuvertValidator): if not ontology_sheet: ontology_sheet = wb.create_sheet(title="Ontologies") - data_validation = validator.generate(get_column_letter(current_data_column), get_column_letter(current_ontology_column), ontology_sheet) + data_validation = validator.generate(get_column_letter(current_data_column), column_name, get_column_letter(current_ontology_column), ontology_sheet) current_ontology_column += 1 elif isinstance(validator, SetValidator): # Total size, including separators must be < 256 @@ -52,18 +52,18 @@ def generate(self): data_validation = validator.generate(get_column_letter(current_data_column), column_name, get_column_letter(current_set_column), set_sheet) current_set_column += 1 else: - data_validation = validator.generate(get_column_letter(current_data_column)) + data_validation = validator.generate(get_column_letter(current_data_column), column_name) set_columns[column_name] = get_column_letter(current_data_column) elif isinstance(validator, LinkedSetValidator): if not set_sheet: set_sheet = wb.create_sheet(title="Sets") - data_validation = validator.generate(get_column_letter(current_data_column), set_columns, column_name, get_column_letter(current_set_column), set_sheet, wb) + data_validation = validator.generate(get_column_letter(current_data_column), column_name, set_columns, get_column_letter(current_set_column), set_sheet, wb) current_set_column += 1 set_columns[column_name] = get_column_letter(current_data_column) elif isinstance(validator, UniqueValidator): - data_validation = validator.generate(get_column_letter(current_data_column), column_dict) + data_validation = validator.generate(get_column_letter(current_data_column), column_name, column_dict) else: - data_validation = validator.generate(get_column_letter(current_data_column)) + data_validation = validator.generate(get_column_letter(current_data_column), column_name) if data_validation: data_sheet.add_data_validation(data_validation) current_data_column += 1 @@ -71,6 +71,9 @@ def generate(self): for column_cells in sheet.columns: length = (max(len(self.as_text(cell.value)) for cell in column_cells) + 2) * 1.2 sheet.column_dimensions[get_column_letter(column_cells[0].column)].width = length + + if self.freeze_header: + data_sheet.freeze_panes = "A2" wb.save(filename=self.output) def as_text(self, value): diff --git a/checkcel/checkplate.py b/checkcel/checkplate.py index 6985166..473cebc 100644 --- a/checkcel/checkplate.py +++ b/checkcel/checkplate.py @@ -15,19 +15,24 @@ class Checkplate(object): """ Base class for templates """ - def __init__(self, validators={}, empty_ok=False, ignore_case=False, ignore_space=False, metadata=[], expected_rows=None): + def __init__(self, validators={}, empty_ok=False, ignore_case=False, ignore_space=False, metadata=[], expected_rows=None, na_ok=False, unique=False, skip_generation=False, skip_validation=False, freeze_header=False): self.metadata = metadata self.logger = logs.logger self.validators = validators or getattr(self, "validators", {}) self.logs = [] # Will be overriden by validators config self.empty_ok = empty_ok + self.na_ok = na_ok + self.unique = unique + self.skip_generation = skip_generation + self.skip_validation = skip_validation self.ignore_case = ignore_case self.ignore_space = ignore_space self.expected_rows = expected_rows + self.freeze_header = freeze_header # self.trim_values = False for validator in self.validators.values(): - validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space) + validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation) def debug(self, message): self.logger.debug(message) @@ -69,9 +74,14 @@ def load_from_python_file(self, file_path): self.metadata = getattr(custom_class, 'metadata', []) self.validators = deepcopy(custom_class.validators) self.empty_ok = getattr(custom_class, 'empty_ok', False) + self.na_ok = getattr(custom_class, 'na_ok', False) + self.unique = getattr(custom_class, 'unique', False) + self.skip_generation = getattr(custom_class, 'skip_generation', False) + self.skip_validation = getattr(custom_class, 'skip_validation', False) self.ignore_case = getattr(custom_class, 'ignore_case', False) self.ignore_space = getattr(custom_class, 'ignore_space', False) self.expected_rows = getattr(custom_class, 'expected_rows', 0) + self.freeze_header = getattr(custom_class, 'freeze_header', False) try: self.expected_rows = int(self.expected_rows) except ValueError: @@ -80,7 +90,7 @@ def load_from_python_file(self, file_path): ) for key, validator in self.validators.items(): - validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space) + validator._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation) return self def load_from_json_file(self, file_path): @@ -136,9 +146,14 @@ def _load_from_dict(self, data): return exits.UNAVAILABLE self.empty_ok = data.get("empty_ok", False) + self.na_ok = data.get("na_ok", False) self.ignore_case = data.get('ignore_case', False) self.ignore_space = data.get('ignore_space', False) self.expected_rows = data.get('expected_rows', 0) + self.unique = data.get('unique', False) + self.skip_generation = data.get('skip_generation', False) + self.skip_validation = data.get('skip_validation', False) + self.freeze_header = data.get('freeze_header', False) try: self.expected_rows = int(self.expected_rows) except ValueError: @@ -161,7 +176,7 @@ def _load_from_dict(self, data): try: validator_class = getattr(validators, validator['type']) val = validator_class(**options) - val._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space) + val._set_attributes(self.empty_ok, self.ignore_case, self.ignore_space, self.na_ok, self.unique, self.skip_generation, self.skip_validation) except AttributeError: self.error( "{} is not a valid Checkcel Validator".format(validator['type']) diff --git a/checkcel/validators.py b/checkcel/validators.py index f9c8e5c..a76817f 100644 --- a/checkcel/validators.py +++ b/checkcel/validators.py @@ -18,11 +18,12 @@ class Validator(object): """ Generic Validator class """ - def __init__(self, empty_ok=None, ignore_case=None, ignore_space=None, empty_ok_if=None, empty_ok_unless=None, readme=None, unique=False): + def __init__(self, empty_ok=None, ignore_case=None, ignore_space=None, empty_ok_if=None, empty_ok_unless=None, readme=None, unique=None, na_ok=None, skip_generation=None, skip_validation=None): self.logger = logs.logger self.invalid_dict = defaultdict(set) self.fail_count = 0 self.empty_ok = empty_ok + self.na_ok = na_ok self.ignore_case = ignore_case self.ignore_space = ignore_space self.empty_ok_if = empty_ok_if @@ -31,6 +32,8 @@ def __init__(self, empty_ok=None, ignore_case=None, ignore_space=None, empty_ok_ self.readme = readme self.unique = unique self.unique_values = set() + self.skip_generation = skip_generation + self.skip_validation = skip_validation if empty_ok_if: if not (isinstance(empty_ok_if, dict) or isinstance(empty_ok_if, list) or isinstance(empty_ok_if, str)): @@ -91,7 +94,7 @@ def validate(self, field, row_number, row): """ Validate the given field. Also is given the row context """ raise NotImplementedError - def generate(self, column): + def generate(self, column, column_name): """ Generate an openpyxl Datavalidation entity. Pass the column for custom formulas""" raise NotImplementedError @@ -99,14 +102,45 @@ def describe(self, column_name): """ Return a line of text describing allowed values""" raise NotImplementedError - def _set_attributes(self, empty_ok_template, ignore_case_template, ignore_space_template): + def _set_attributes(self, empty_ok_template=False, ignore_case_template=False, ignore_space_template=False, na_ok_template=False, unique=False, skip_generation=False, skip_validation=False): # Override with template value if it was not set (default to None) if self.empty_ok is None: self.empty_ok = empty_ok_template + if self.na_ok is None: + self.na_ok = na_ok_template if self.ignore_case is None: self.ignore_case = ignore_case_template if self.ignore_space is None: self.ignore_space = ignore_space_template + if self.unique is None: + self.unique = unique + if self.skip_generation is None: + self.skip_generation = skip_generation + if self.skip_validation is None: + self.skip_validation = skip_validation + + def _format_formula(self, parameter_list, column): + formula = "" + + if self.unique: + internal_value = "${0}2:${0}1048576,{0}2".format(column) + parameter_list.append('COUNTIF({})<2'.format(internal_value)) + + if len(parameter_list) == 0: + return "" + + if len(parameter_list) == 1: + formula = parameter_list[0] + + if len(parameter_list) > 1: + formula = "AND({})".format(",".join(parameter_list)) + + if self.na_ok: + na_form = 'OR(LOWER(${}2)="na", LOWER(${}2)="n/a")'.format(column, column) + formula = 'OR({},{})'.format(na_form, formula) + + formula = "={}".format(formula) + return formula class NoValidator(Validator): @@ -118,7 +152,7 @@ def __init__(self, **kwargs): def validate(self, field, row_number, row={}): pass - def generate(self, column): + def generate(self, column, column_name): return None def describe(self, column_name): @@ -138,9 +172,15 @@ def __init__(self, **kwargs): super(TextValidator, self).__init__(**kwargs) def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) + if self.na_ok and field.lower() in ['na', 'n/a']: + return + if not field and not self._can_be_empty(row): raise ValidationException( "Field cannot be empty" @@ -155,13 +195,16 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column): + def generate(self, column, column_name): + if self.skip_generation: + return None + if self.unique: params = {"type": "custom", "allow_blank": self.empty_ok} - internal_value = "${0}:${0},{0}2".format(column) - params["formula1"] = '=COUNTIF({})<2'.format(internal_value) + formula = self._format_formula([], column) + params["formula1"] = formula dv = DataValidation(**params) - dv.error = 'Value must be unique' + dv.error = self.describe(column_name) dv.add("{}2:{}1048576".format(column, column)) return dv @@ -180,6 +223,9 @@ def __init__(self, min=None, max=None, **kwargs): self.max = max def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) @@ -188,6 +234,9 @@ def validate(self, field, row_number, row): try: if field or not self._can_be_empty(row): + if self.na_ok and str(field).lower() in ['na', 'n/a']: + return + field = float(field) if self.type == "whole" and not (field).is_integer(): raise ValueError @@ -215,19 +264,29 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column): - params = {"type": self.type, "allow_blank": self.empty_ok} - if (self.min is not None and self.max is not None): - params["formula1"] = self.min - params["formula2"] = self.max - params["operator"] = "between" - elif self.min is not None: - params["formula1"] = self.min - params["operator"] = "greaterThanOrEqual" - elif self.max is not None: - params["formula1"] = self.max - params["operator"] = "lessThanOrEqual" + def generate(self, column, column_name): + if self.skip_generation: + return None + + params = {"type": "custom", "allow_blank": self.empty_ok} + formulas = [] + if self.type == "whole": + formulas.append("IFERROR(MOD({}2,1)=0,FALSE)".format(column)) + else: + formulas.append("ISNUMBER({}2)".format(column)) + + if self.min is not None: + formulas.append("{}2>={}".format(column, self.min)) + + if self.max is not None: + formulas.append("{}2<={}".format(column, self.max)) + + formula = self._format_formula(formulas, column) + + params['formula1'] = formula + dv = DataValidation(**params) + dv.error = self.describe(column_name) dv.add("{}2:{}1048576".format(column, column)) return dv @@ -276,8 +335,13 @@ def __init__(self, valid_values=set(), **kwargs): self.valid_values = set([str(val) for val in valid_values]) if self.empty_ok: self.valid_values.add("") + if self.na_ok: + self.valid_values.add("N/A") def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) @@ -292,6 +356,8 @@ def validate(self, field, row_number, row): raise ValidationException( "'{}' is invalid".format(field) ) + if self.na_ok and field.lower() in ['na', 'n/a']: + return if field not in self.valid_values: self.invalid_dict["invalid_set"].add(field) @@ -304,26 +370,29 @@ def validate(self, field, row_number, row): raise ValidationException("'{}' is already in the column".format(field)) self.unique_values.add(field) - def _set_attributes(self, empty_ok_template, ignore_case_template, ignore_space_template): + def _set_attributes(self, empty_ok_template, ignore_case_template, ignore_space_template, na_ok_template, unique_template, skip_generation_template, skip_validation_template): # Override with template value if it was not set (default to None) - if self.empty_ok is None: - self.empty_ok = empty_ok_template + super()._set_attributes(empty_ok_template, ignore_case_template, ignore_space_template, na_ok_template, unique_template, skip_generation_template, skip_validation_template) + if self.empty_ok: self.valid_values.add("") - if self.ignore_case is None: - self.ignore_case = ignore_case_template + if self.na_ok: + self.valid_values.add("N/A") + if self.ignore_case: self.valid_values = set([value.lower() for value in self.valid_values]) - if self.ignore_space is None: - self.ignore_space = ignore_space_template + if self.ignore_case: + self.ignore_space = set([value.strip() for value in self.valid_values]) @property def bad(self): return self.invalid_dict - def generate(self, column, column_name="", additional_column=None, additional_worksheet=None): + def generate(self, column, column_name, additional_column=None, additional_worksheet=None): + if self.skip_generation: + return None # If total length > 256 : need to use cells on another sheet if additional_column and additional_worksheet: params = {"type": "list", "allow_blank": self.empty_ok} @@ -357,12 +426,17 @@ def __init__(self, linked_column="", valid_values={}, **kwargs): self.linked_column = linked_column self.column_check = False + self._clean_values() + def _precheck_unique_with(self, row): if self.linked_column not in row.keys(): raise BadValidatorException("Linked column {} is not in file columns".format(self.linked_column)) self.column_check = True def validate(self, field, row_number, row): + if self.skip_validation: + return None + if self.ignore_case: field = field.lower() if self.ignore_space: @@ -374,6 +448,9 @@ def validate(self, field, row_number, row): if not field and self.empty_ok: return + if self.na_ok and field.lower() in ['na', 'n/a']: + return + related_column_value = row[self.linked_column] if not related_column_value: self.invalid_dict["invalid_rows"].add(row_number) @@ -397,7 +474,9 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column, set_columns, column_name, additional_column, additional_worksheet, workbook): + def generate(self, column, column_name, set_columns, additional_column, additional_worksheet, workbook): + if self.skip_generation: + return None if self.linked_column not in set_columns: # TODO raise warning return None @@ -426,6 +505,26 @@ def describe(self, column_name): column_name += " ({})".format(self.readme) return "{} : Linked values to column {} {}{}".format(column_name, self.linked_column, "(required)" if not self.empty_ok else "", "(unique)" if self.unique else "") + def _set_attributes(self, empty_ok_template, ignore_case_template, ignore_space_template, na_ok_template, unique_template, skip_generation_template, skip_validation_template): + # Override with template value if it was not set (default to None) + super()._set_attributes(empty_ok_template, ignore_case_template, ignore_space_template, na_ok_template, unique_template, skip_generation_template, skip_validation_template) + self._clean_values() + + def _clean_values(self): + for key, values in self.valid_values.items(): + cleaned_values = set() + for value in values: + if self.ignore_case: + value = value.lower() + if self.ignore_space: + value = value.strip() + cleaned_values.add(value) + if self.empty_ok: + cleaned_values.add("") + if self.na_ok: + cleaned_values.add("N/A") + self.valid_values[key] = cleaned_values + class DateValidator(Validator): """ Validates that a field is a Date """ @@ -450,6 +549,9 @@ def __init__(self, day_first=True, before=None, after=None, **kwargs): self.after = after def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) @@ -458,6 +560,8 @@ def validate(self, field, row_number, row): try: if field or not self._can_be_empty(row): + if self.na_ok and field.lower() in ['na', 'n/a']: + return # Pandas auto convert fields into dates (ignoring the parse_dates=False) field = str(field) date = parser.parse(field, dayfirst=self.day_first).date() @@ -486,24 +590,30 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column, additional_column=None, additional_worksheet=None): + def generate(self, column, column_name, additional_column=None, additional_worksheet=None): + if self.skip_generation: + return None # GreaterThanOrEqual for validity with ODS. - params = {"type": "date", "allow_blank": self.empty_ok} - if (self.before is not None and self.after is not None): - params["formula1"] = parser.parse(self.after).strftime("%Y/%m/%d") - params["formula2"] = parser.parse(self.before).strftime("%Y/%m/%d") - params["operator"] = "between" - elif self.before is not None: - params["formula1"] = parser.parse(self.before).strftime("%Y/%m/%d") - params["operator"] = "lessThanOrEqual" - elif self.after is not None: - params["formula1"] = parser.parse(self.after).strftime("%Y/%m/%d") - params["operator"] = "greaterThanOrEqual" - else: - params["formula1"] = "01/01/1900" - params["operator"] = "greaterThanOrEqual" + params = {"type": "custom", "allow_blank": self.empty_ok} + formulas = [] + + formulas.append("ISNUMBER({}2)".format(column)) + if self.before is not None: + if parser.parse(self.before) < parser.parse("01/01/1900"): + self.warn("Before date is before 01/01/1900: Validation will not work in excel, skipping") + else: + formulas.append('{}2<=DATEVALUE("{}")'.format(column, parser.parse(self.before).strftime("%Y/%m/%d"))) + if self.after is not None: + if parser.parse(self.after) < parser.parse("01/01/1900"): + self.warn("After date is before 01/01/1900: Validation will not work in excel, skipping") + else: + formulas.append('{}2>=DATEVALUE("{}")'.format(column, parser.parse(self.after).strftime("%Y/%m/%d"))) + + formula = self._format_formula(formulas, column) + params['formula1'] = formula dv = DataValidation(**params) + dv.error = self.describe(column_name) dv.add("{}2:{}1048576".format(column, column)) return dv @@ -548,6 +658,9 @@ def __init__(self, before=None, after=None, **kwargs): self.after = after def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) @@ -555,6 +668,8 @@ def validate(self, field, row_number, row): field = field.strip() try: if field or not self._can_be_empty(row): + if self.na_ok and field.lower() in ['na', 'n/a']: + return # Pandas auto convert fields into dates (ignoring the parse_dates=False) field = str(field) time = parser.parse(field).time() @@ -583,22 +698,24 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column, additional_column=None, additional_worksheet=None): + def generate(self, column, column_name, additional_column=None, additional_worksheet=None): + if self.skip_generation: + return None # GreaterThanOrEqual for validity with ODS. + params = {"type": "custom", "allow_blank": self.empty_ok} + formulas = [] - params = {"type": "time", "allow_blank": self.empty_ok} - if (self.before is not None and self.after is not None): - params["formula1"] = parser.parse(self.after).strftime("%H:%M:%S") - params["formula2"] = parser.parse(self.before).strftime("%H:%M:%S") - params["operator"] = "between" - elif self.before is not None: - params["formula1"] = parser.parse(self.before).strftime("%H:%M:%S") - params["operator"] = "lessThanOrEqual" - elif self.after is not None: - params["formula1"] = parser.parse(self.after).strftime("%H:%M:%S") - params["operator"] = "greaterThanOrEqual" + formulas.append("IsNumber({}2)".format(column)) + if self.before is not None: + formulas.append('{}2<=TIMEVALUE("{}")'.format(column, parser.parse(self.before).time())) + if self.after is not None: + formulas.append('{}2>=TIMEVALUE("{}")'.format(column, parser.parse(self.after).time())) + + formula = self._format_formula(formulas, column) + params['formula1'] = formula dv = DataValidation(**params) + dv.error = self.describe(column_name) + " (using ':' as separators)" dv.add("{}2:{}1048576".format(column, column)) return dv @@ -628,12 +745,17 @@ def __init__(self, **kwargs): super(EmailValidator, self).__init__(**kwargs) def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) if self.ignore_space: field = field.strip() if field or not self._can_be_empty(row): + if self.na_ok and field.lower() in ['na', 'n/a']: + return try: validate_email(field) except EmailNotValidError as e: @@ -649,9 +771,14 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column, ontology_column=None): + def generate(self, column, column_name, ontology_column=None): + if self.skip_generation: + return None params = {"type": "custom", "allow_blank": self.empty_ok} - params["formula1"] = '=ISNUMBER(MATCH("*@*.?*",{}2,0))'.format(column) + formulas = ['ISNUMBER(MATCH("*@*.?*",{}2,0))'.format(column)] + formula = self._format_formula(formulas, column) + params['formula1'] = formula + dv = DataValidation(**params) dv.error = 'Value must be an email' dv.add("{}2:{}1048576".format(column, column)) @@ -680,15 +807,24 @@ def __init__(self, ontology, root_term="", **kwargs): raise BadValidatorException("'{}' is not a valid root term for ontology {}".format(self.root_term, self.ontology)) def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) if self.ignore_space: field = field.strip() + if self.ignore_case: + field = field.lower() + if field == "" and self._can_be_empty(row): return + if self.na_ok and field.lower() in ['na', 'n/a']: + return + if field in self.invalid_dict["invalid_set"]: self.invalid_dict["invalid_rows"].add(row_number) raise ValidationException("{} is not an ontological term".format(field)) @@ -709,8 +845,15 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column, additional_column, additional_worksheet): + def generate(self, column, column_name, additional_column, additional_worksheet): + if self.skip_generation: + return None terms = self._get_ontological_terms() + if self.empty_ok: + terms.add("") + if self.na_ok: + terms.add("N/A") + cell = additional_worksheet.cell(column=column_index_from_string(additional_column), row=1, value=self.ontology) cell.font = Font(color="FF0000", bold=True) row = 2 @@ -799,6 +942,8 @@ def __init__(self, unique_with=[], **kwargs): self.unique_values = set() self.unique_with = unique_with self.unique_check = False + # Disable this value just in case + self.unique = False def _precheck_unique_with(self, row): extra = set(self.unique_with) - set(row.keys()) @@ -807,12 +952,18 @@ def _precheck_unique_with(self, row): self.unique_check = True def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) if self.ignore_space: field = field.strip() + if self.ignore_case: + field = field.lower() + if not field: if self._can_be_empty(row): return @@ -821,6 +972,9 @@ def validate(self, field, row_number, row): "Field cannot be empty" ) + if self.na_ok and field.lower() in ['na', 'n/a']: + return + if self.unique_with and not self.unique_check: self._precheck_unique_with(row) @@ -843,16 +997,24 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column, column_dict): + def generate(self, column, column_name, column_dict): + if self.skip_generation: + return None if self.unique_with and not all([val in column_dict for val in self.unique_with]): raise BadValidatorException("Using unique_with, but the related column was not defined before") params = {"type": "custom", "allow_blank": self.empty_ok} - internal_value = "${0}:${0},{0}2".format(column) + internal_value = "${0}2:${0}1048576,{0}2".format(column) if self.unique_with: for col in self.unique_with: - internal_value += ",${0}:${0},{0}2".format(column_dict[col]) - params["formula1"] = '=COUNTIF({})<2'.format(internal_value) + internal_value += ",${0}2:${0}1048576,{0}2".format(column_dict[col]) + + formulas = [] + + formulas.append('COUNTIFS({})<2'.format(internal_value)) + formula = self._format_formula(formulas, column) + + params["formula1"] = formula dv = DataValidation(**params) dv.error = 'Value must be unique' dv.add("{}2:{}1048576".format(column, column)) @@ -892,15 +1054,24 @@ def __init__(self, root_term="", lang="en", labellang="en", vocab="thesaurus-inr raise BadValidatorException("'{}' is not a valid root term. Make sure it is a concept, and not a microthesaurus or group".format(self.root_term)) def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) if self.ignore_space: field = field.strip() + if self.ignore_case: + field = field.lower() + if field == "" and self._can_be_empty(row): return + if self.na_ok and field.lower() in ['na', 'n/a']: + return + if field in self.invalid_dict["invalid_set"]: self.invalid_dict["invalid_rows"].add(row_number) raise ValidationException("{} is not an ontological term".format(field)) @@ -922,7 +1093,9 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column, additional_column, additional_worksheet): + def generate(self, column, column_name, additional_column, additional_worksheet): + if self.skip_generation: + return None # No point in loading 15000 terms # No easy way to do it anyway if not self.root_term_iri: @@ -932,6 +1105,10 @@ def generate(self, column, additional_column, additional_worksheet): return None terms = self._get_vo_terms() + if self.empty_ok: + terms.add("") + if self.na_ok: + terms.add("N/A") if not terms: self.logger.warning( @@ -1027,15 +1204,24 @@ def __init__(self, regex, excel_formula="", **kwargs): raise BadValidatorException("'{}' is not a valid regular expression".format(self.regex)) def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) if self.ignore_space: field = field.strip() + if self.ignore_case: + field = field.lower() + if field == "" and self._can_be_empty(row): return + if self.na_ok and field.lower() in ['na', 'n/a']: + return + matches = re.findall(self.regex, field) if not len(matches) == 1: self.invalid_dict["invalid_set"].add(field) @@ -1051,26 +1237,26 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column): + def generate(self, column, column_name): + if self.skip_generation: + return None # Difficult to use regex in Excel without a VBA macro - if not self.excel_formula: - if self.unique: - params = {"type": "custom", "allow_blank": self.empty_ok} - internal_value = "${0}:${0},{0}2".format(column) - params["formula1"] = '=COUNTIF({})<2'.format(internal_value) - dv = DataValidation(**params) - dv.error = 'Value must be unique' - dv.add("{}2:{}1048576".format(column, column)) - return dv + params = {"type": "custom", "allow_blank": self.empty_ok} + formulas = [] + if self.excel_formula: + formulas.append(self.excel_formula.replace("{CNAME}", column)) + formula = self._format_formula(formulas, column) + + if not formula: self.logger.warning( "Warning: RegexValidator does not generate a validated column" ) return None - + params["formula1"] = formula params = {"type": "custom"} - params["formula1"] = self.excel_formula.replace("{CNAME}", column) + dv = DataValidation(**params) - dv.error = 'Value must match validation' + dv.error = self.describe(column_name) dv.add("{}2:{}1048576".format(column, column)) return dv @@ -1102,6 +1288,9 @@ def __init__(self, format="DD", only_long=False, only_lat=False, **kwargs): self.only_lat = only_lat def validate(self, field, row_number, row): + if self.skip_validation: + return None + if not self.empty_check: self._precheck_empty_ok_if(row) @@ -1111,6 +1300,9 @@ def validate(self, field, row_number, row): if field == "" and self._can_be_empty(row): return + if self.na_ok and field.lower() in ['na', 'n/a']: + return + if self.format == "DD": regex_lat = r"[-+]?((90(\.0+)?)|([1-8]?\d(\.\d+)?))[NSns]?" regex_long = r"[-+]?((180(\.0+)?)|(((1[0-7]\d)|([1-9]?\d))(\.\d+)?))[wWeE]?" @@ -1140,20 +1332,27 @@ def validate(self, field, row_number, row): def bad(self): return self.invalid_dict - def generate(self, column): + def generate(self, column, column_name): + if self.skip_generation: + return None # Difficult to use regex in Excel without a VBA macro - if self.unique: - params = {"type": "custom", "allow_blank": self.empty_ok} - internal_value = "${0}:${0},{0}2".format(column) - params["formula1"] = '=COUNTIF({})<2'.format(internal_value) - dv = DataValidation(**params) - dv.error = 'Value must be unique' - dv.add("{}2:{}1048576".format(column, column)) - return dv - self.logger.warning( - "Warning: GPSValidator does not generate a validated column" - ) - return None + formulas = [] + formula = self._format_formula(formulas, column) + params = {"type": "custom", "allow_blank": self.empty_ok} + + if not formula: + self.logger.warning( + "Warning: GPSValidator does not generate a validated column" + ) + return None + + params["formula1"] = formula + params = {"type": "custom"} + + dv = DataValidation(**params) + dv.error = self.describe(column_name) + dv.add("{}2:{}1048576".format(column, column)) + return dv def describe(self, column_name): if self.readme: diff --git a/setup.py b/setup.py index d41d29c..ee05258 100644 --- a/setup.py +++ b/setup.py @@ -5,7 +5,7 @@ setup( name="checkcel", - version='0.0.2', + version='0.0.3', description="Generate and validate tabulated/spreadsheet files", author="Mateo Boudet", author_email="mateo.boudet@inrae.fr", diff --git a/tests/test_validate_datetime.py b/tests/test_validate_datetime.py new file mode 100644 index 0000000..34e168a --- /dev/null +++ b/tests/test_validate_datetime.py @@ -0,0 +1,160 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import DateValidator, TimeValidator + + +class TestCheckcelValidateDate(): + + def test_invalid(self): + data = {'my_column': ['thisisnotadate', '1991/01/1991']} + validators = {'my_column': DateValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 2 + + def test_invalid_before(self): + data = {'my_column': ['01/01/2000', '10/10/2010']} + validators = {'my_column': DateValidator(before="05/05/2005")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_after(self): + data = {'my_column': ['01/01/2000', '10/10/2010']} + validators = {'my_column': DateValidator(after="05/05/2005")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['01/01/1970', '']} + validators = {'my_column': DateValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['01/01/1970', '']} + validators = {'my_column': DateValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': ['01/01/1970', '01/01/1970']} + validators = {'my_column': DateValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['', '01/01/1970', '']} + validators = {'my_column': DateValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['01/01/1970', 'na', 'n/a']} + validators = {'my_column': DateValidator(na_ok=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ['01/01/1970', '01-01-1970', '1970/01/01', '01 01 1970']} + validators = {'my_column': DateValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() + + +class TestCheckcelValidateTime(): + + def test_invalid(self): + data = {'my_column': ['thisisnotatime', '248:26']} + validators = {'my_column': TimeValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 2 + + def test_invalid_before(self): + data = {'my_column': ['14h23', '16h30']} + validators = {'my_column': TimeValidator(before="15h00")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_after(self): + data = {'my_column': ['14h23', '16h30']} + validators = {'my_column': TimeValidator(after="15h00")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['13h10', '']} + validators = {'my_column': TimeValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['13h10', 'na']} + validators = {'my_column': TimeValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': ['13h10', '13h10']} + validators = {'my_column': TimeValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['', '13h10', '']} + validators = {'my_column': TimeValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['13h10', 'na', 'n/a']} + validators = {'my_column': TimeValidator(na_ok=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ['13h10', '2h36PM']} + validators = {'my_column': TimeValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() diff --git a/tests/test_validate_gps.py b/tests/test_validate_gps.py new file mode 100644 index 0000000..37cc1bc --- /dev/null +++ b/tests/test_validate_gps.py @@ -0,0 +1,93 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import GPSValidator + + +class TestCheckcelValidateGPS(): + + def test_invalid_dd(self): + data = {'my_column': ['invalidvalue', '46.174181N 14.801100E']} + validators = {'my_column': GPSValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_dms(self): + data = {'my_column': ['invalidvalue', '45°45\'32.4"N 09°23\'39.9"E']} + validators = {'my_column': GPSValidator(format="DMS")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + print(validation.failures['my_column']) + assert len(validation.failures['my_column']) == 1 + + def test_invalid_lat(self): + data = {'my_column': ['46.174181N', '46.174181N 14.801100E']} + validators = {'my_column': GPSValidator(only_lat=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_long(self): + data = {'my_column': ['140.801100E', '46.174181N 14.801100E']} + validators = {'my_column': GPSValidator(only_long=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + print(validation.failures['my_column']) + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['46.174181N 14.801100E', '']} + validators = {'my_column': GPSValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['46.174181N 14.801100E', 'na']} + validators = {'my_column': GPSValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': ['46.174181N 14.801100E', '46.174181N 14.801100E']} + validators = {'my_column': GPSValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['', '46.174181N 14.801100E', '']} + validators = {'my_column': GPSValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['46.174181N 14.801100E', 'na', 'n/a']} + validators = {'my_column': GPSValidator(na_ok=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ['46.174181N 14.801100E', '+87.174181 -140.801100E']} + validators = {'my_column': GPSValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() diff --git a/tests/test_validate_mail.py b/tests/test_validate_mail.py new file mode 100644 index 0000000..6871932 --- /dev/null +++ b/tests/test_validate_mail.py @@ -0,0 +1,64 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import EmailValidator + + +class TestCheckcelValidateMail(): + + def test_invalid(self): + data = {'my_column': ['invalidemail.emailprovider.com', 'invalidemail@emailprovidercom']} + validators = {'my_column': EmailValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 2 + + def test_invalid_empty(self): + data = {'my_column': ['', 'validemail@emailprovider.com']} + validators = {'my_column': EmailValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['na', 'validemail@emailprovider.com']} + validators = {'my_column': EmailValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': ['validemail@emailprovider.com', 'validemail@emailprovider.com']} + validators = {'my_column': EmailValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['', 'validemail@emailprovider.com', '']} + validators = {'my_column': EmailValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['validemail@emailprovider.com', 'na', 'n/a']} + validators = {'my_column': EmailValidator(na_ok=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ['validemail@emailprovider.com', 'valid2email@emailprovider.com']} + validators = {'my_column': EmailValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() diff --git a/tests/test_validate_number.py b/tests/test_validate_number.py new file mode 100644 index 0000000..a855f9f --- /dev/null +++ b/tests/test_validate_number.py @@ -0,0 +1,187 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import IntValidator, FloatValidator + + +class TestCheckcelValidateFloat(): + + def test_invalid_string(self): + data = {'my_column': ['notanumber']} + validators = {'my_column': FloatValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['', 6]} + validators = {'my_column': FloatValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['na', 6]} + validators = {'my_column': FloatValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': [1, 1]} + validators = {'my_column': FloatValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def invalid_min(self): + data = {'my_column': [6, 4]} + validators = {'my_column': FloatValidator(min=5)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def invalid_max(self): + data = {'my_column': [6, 4]} + validators = {'my_column': FloatValidator(max=5)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def invalid_both(self): + data = {'my_column': [8, 6.1, 5]} + validators = {'my_column': FloatValidator(max=7.5, min=5.5)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 2 + + def test_valid_empty(self): + data = {'my_column': ['', 6, '']} + validators = {'my_column': FloatValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['na', 6, 'n/a']} + validators = {'my_column': FloatValidator(na_ok=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': [6, 4, "9.0"]} + validators = {'my_column': FloatValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() + + +class TestCheckcelValidateInt(): + + def test_invalid_string(self): + data = {'my_column': ['notanumber']} + validators = {'my_column': IntValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_float(self): + data = {'my_column': ['4.8']} + validators = {'my_column': IntValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': [1, 1]} + validators = {'my_column': IntValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['', 6]} + validators = {'my_column': IntValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['na', 6]} + validators = {'my_column': IntValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def invalid_min(self): + data = {'my_column': [6, 4]} + validators = {'my_column': IntValidator(min=5)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def invalid_max(self): + data = {'my_column': [6, 4]} + validators = {'my_column': IntValidator(max=5)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def invalid_both(self): + data = {'my_column': [8, 6, 4]} + validators = {'my_column': IntValidator(max=7, min=5)} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 2 + + def test_valid_empty(self): + data = {'my_column': ['', 6, '']} + validators = {'my_column': IntValidator(unique=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['na', 6, 'n/a']} + validators = {'my_column': IntValidator(na_ok=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': [6, 4, "9"]} + validators = {'my_column': IntValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() diff --git a/tests/test_validate_params.py b/tests/test_validate_params.py new file mode 100644 index 0000000..a04469e --- /dev/null +++ b/tests/test_validate_params.py @@ -0,0 +1,107 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import TextValidator + + +class TestCheckcelClass(): + + def test_invalid_rows_below(self): + data = {'my_column': ['myvalue', 'my_value2']} + validators = {'my_column': TextValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, expected_rows=1, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.logs) == 2 + assert validation.logs[1] == "Error: Length issue: Expecting 1 row(s), found 2" + + def test_invalid_rows_above(self): + data = {'my_column': ['myvalue']} + validators = {'my_column': TextValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, expected_rows=2, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.logs) == 2 + assert validation.logs[1] == "Error: Length issue: Expecting 2 row(s), found 1" + + +class TestCheckcelValidateEmpty_if(): + + def test_invalid_string(self): + data = {'my_column': ["", "not_empty"], "another_column": ["", ""]} + validators = { + 'my_column': TextValidator(empty_ok=True), + 'another_column': TextValidator(empty_ok_if="my_column") + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 1 + + def test_invalid_list(self): + data = {'my_column': ["", "", "not_empty", "not_empty"], 'my_column2': ["", "not_empty", "", "not_empty"], "another_column": ["", "", "", ""]} + validators = { + 'my_column': TextValidator(empty_ok=True), + 'my_column2': TextValidator(empty_ok=True), + 'another_column': TextValidator(empty_ok_if=["my_column", "my_column2"]) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 3 + + def test_invalid_dict(self): + data = data = {'my_column': ["", "invalid_value", "valid_value"], "another_column": ["", "", ""]} + validators = { + 'my_column': TextValidator(empty_ok=True), + 'another_column': TextValidator(empty_ok_if={"my_column": ["valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 2 + + +class TestCheckcelValidateEmpty_unless(): + + def test_invalid_string(self): + data = {'my_column': ["", "not_empty"], "another_column": ["", ""]} + validators = { + 'my_column': TextValidator(empty_ok=True), + 'another_column': TextValidator(empty_ok_unless="my_column") + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 1 + + def test_invalid_list(self): + data = {'my_column': ["", "", "not_empty", "not_empty"], 'my_column2': ["", "not_empty", "", "not_empty"], "another_column": ["", "", "", ""]} + validators = { + 'my_column': TextValidator(empty_ok=True), + 'my_column2': TextValidator(empty_ok=True), + 'another_column': TextValidator(empty_ok_unless=["my_column", "my_column2"]) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 3 + + def test_invalid_dict(self): + data = data = {'my_column': ["", "invalid_value", "valid_value"], "another_column": ["", "", ""]} + validators = { + 'my_column': TextValidator(empty_ok=True), + 'another_column': TextValidator(empty_ok_unless={"my_column": ["invalid_value"]}) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 1 diff --git a/tests/test_validate_regex.py b/tests/test_validate_regex.py new file mode 100644 index 0000000..50611fd --- /dev/null +++ b/tests/test_validate_regex.py @@ -0,0 +1,64 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import RegexValidator + + +class TestCheckcelValidateRegex(): + + def test_invalid(self): + data = {'my_column': ['ABC', 'AFX123']} + validators = {'my_column': RegexValidator(regex="AFX.*")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['', 'AFX123']} + validators = {'my_column': RegexValidator(regex="AFX.*")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['na', 'AFX123']} + validators = {'my_column': RegexValidator(regex="AFX.*")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': ['AFX123', 'AFX123']} + validators = {'my_column': RegexValidator(unique=True, regex="AFX.*")} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['', 'AFX123', '']} + validators = {'my_column': RegexValidator(unique=True, regex="AFX.*")} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['na', 'AFX123', 'n/a']} + validators = {'my_column': RegexValidator(na_ok=True, regex="AFX.*")} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ['AFX123', 'AFX456']} + validators = {'my_column': RegexValidator(regex="AFX.*")} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() diff --git a/tests/test_validate_set.py b/tests/test_validate_set.py new file mode 100644 index 0000000..fa2e09e --- /dev/null +++ b/tests/test_validate_set.py @@ -0,0 +1,145 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import SetValidator, LinkedSetValidator + + +class TestCheckcelValidateSet(): + + def test_invalid(self): + data = {'my_column': ['invalid_value', 'valid_value']} + validators = {'my_column': SetValidator(valid_values=["valid_value"])} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['valid_value', '']} + validators = {'my_column': SetValidator(valid_values=["valid_value"])} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['valid_value', 'na']} + validators = {'my_column': SetValidator(valid_values=["valid_value"])} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': ['valid_value', 'valid_value']} + validators = {'my_column': SetValidator(unique=True, valid_values=["valid_value"])} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['', 'valid_value', '']} + validators = {'my_column': SetValidator(unique=True, valid_values=["valid_value"])} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['na', 'valid_value', 'n/a']} + validators = {'my_column': SetValidator(na_ok=True, valid_values=["valid_value"])} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ["valid_value1", "valid_value2"]} + validators = {'my_column': SetValidator(valid_values=["valid_value1", "valid_value2"])} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() + + +class TestCheckcelValidateLinkedSet(): + + def test_invalid(self): + data = {'my_column': ['value_1', 'value_2'], "another_column": ["valid_value", "invalid_value"]} + validators = { + 'my_column': SetValidator(valid_values=['value_1', 'value_2']), + 'another_column': LinkedSetValidator(linked_column="my_column", valid_values={"value_1": ["valid_value"], "value_2": ["another_valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['value_1', 'value_2', 'value2'], "another_column": ["valid_value", "another_valid_value", ""]} + validators = { + 'my_column': SetValidator(valid_values=['value_1', 'value_2']), + 'another_column': LinkedSetValidator(linked_column="my_column", valid_values={"value_1": ["valid_value"], "value_2": ["another_valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['value_1', 'value_2', 'value2'], "another_column": ["valid_value", "another_valid_value", "na"]} + validators = { + 'my_column': SetValidator(valid_values=['value_1', 'value_2']), + 'another_column': LinkedSetValidator(linked_column="my_column", valid_values={"value_1": ["valid_value"], "value_2": ["another_valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 1 + + def test_invalid_unique(self): + data = {'my_column': ['value_1', 'value_2', 'value2'], "another_column": ["valid_value", "another_valid_value", "another_valid_value"]} + validators = { + 'my_column': SetValidator(valid_values=['value_1', 'value_2']), + 'another_column': LinkedSetValidator(unique=True, linked_column="my_column", valid_values={"value_1": ["valid_value"], "value_2": ["another_valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['another_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['value_1', 'value_2', 'value_2', 'value_2'], "another_column": ["valid_value", "another_valid_value", "", ""]} + validators = { + 'my_column': SetValidator(valid_values=['value_1', 'value_2']), + 'another_column': LinkedSetValidator(unique=True, linked_column="my_column", valid_values={"value_1": ["valid_value"], "value_2": ["another_valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['value_1', 'value_2', 'value_2', 'value_2'], "another_column": ["valid_value", "another_valid_value", "na", "n/a"]} + validators = { + 'my_column': SetValidator(valid_values=['value_1', 'value_2']), + 'another_column': LinkedSetValidator(na_ok=True, linked_column="my_column", valid_values={"value_1": ["valid_value"], "value_2": ["another_valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ['value_1', 'value_2', 'value_2'], "another_column": ["valid_value", "another_valid_value", "another_valid_value"]} + validators = { + 'my_column': SetValidator(valid_values=['value_1', 'value_2']), + 'another_column': LinkedSetValidator(linked_column="my_column", valid_values={"value_1": ["valid_value"], "value_2": ["another_valid_value"]}) + } + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() diff --git a/tests/test_validate_unique.py b/tests/test_validate_unique.py new file mode 100644 index 0000000..e948d55 --- /dev/null +++ b/tests/test_validate_unique.py @@ -0,0 +1,71 @@ +import pandas as pd + +from checkcel import Checkcel +from checkcel.validators import UniqueValidator, NoValidator + + +class TestCheckcelValidateUnique(): + + def test_invalid(self): + data = {'my_column': ['notunique', 'notunique']} + validators = {'my_column': UniqueValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_empty(self): + data = {'my_column': ['unique', '']} + validators = {'my_column': UniqueValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_na(self): + data = {'my_column': ['unique', 'na', 'na']} + validators = {'my_column': UniqueValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_invalid_multiple(self): + data = {'my_column': ['unique1', 'unique1'], 'another_column': ['val2', 'val2']} + validators = {'my_column': UniqueValidator(unique_with=["another_column"]), 'another_column': NoValidator()} + df = pd.DataFrame.from_dict(data) + validation = Checkcel(data=df, empty_ok=False, validators=validators) + val = validation.validate() + assert val is False + assert len(validation.failures['my_column']) == 1 + + def test_valid_empty(self): + data = {'my_column': ['', 'unique']} + validators = {'my_column': UniqueValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid_na(self): + data = {'my_column': ['na', 'unique', 'na']} + validators = {'my_column': UniqueValidator(na_ok=True)} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, empty_ok=True, validators=validators) + assert val.validate() + + def test_valid(self): + data = {'my_column': ['unique1', 'unique2']} + validators = {'my_column': UniqueValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate() + + def test_valid_multiple(self): + data = {'my_column': ['unique1', 'unique1'], 'another_column': ['val1', 'val2']} + validators = {'my_column': UniqueValidator(unique_with=["another_column"]), 'another_column': NoValidator()} + df = pd.DataFrame.from_dict(data) + val = Checkcel(data=df, validators=validators) + assert val.validate()