Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaValidationError relevance heuristic is flawed #3752

Open
dangotbanned opened this issue Jan 6, 2025 · 0 comments
Open

SchemaValidationError relevance heuristic is flawed #3752

dangotbanned opened this issue Jan 6, 2025 · 0 comments
Labels
bug has-repro Issues with a Minimal, Reproducible Example maintenance

Comments

@dangotbanned
Copy link
Member

dangotbanned commented Jan 6, 2025

What happened?

Note

Originally wrote as comment in #2913, but this will require more work to fix

While debugging some tests for (#3750), I've begun to think the underlying assumption here is flawed:

...validated against multiple schemas and its parent is a common anyOf validator.
The error messages produced from these cases are usually
very similar and we just take the shortest one.

schemapi._deduplicate_additional_properties_errors

def _deduplicate_additional_properties_errors(
errors: ValidationErrorList,
) -> ValidationErrorList:
"""
If there are multiple additional property errors it usually means that the offending element was validated against multiple schemas and its parent is a common anyOf validator.
The error messages produced from these cases are usually
very similar and we just take the shortest one. For example,
the following 3 errors are raised for the `unknown` channel option in
`alt.X("variety", unknown=2)`:
- "Additional properties are not allowed ('unknown' was unexpected)"
- "Additional properties are not allowed ('field', 'unknown' were unexpected)"
- "Additional properties are not allowed ('field', 'type', 'unknown' were unexpected)".
"""
if len(errors) > 1:
# Test if all parent errors are the same anyOf error and only do
# the prioritization in these cases. Can't think of a chart spec where this
# would not be the case but still allow for it below to not break anything.
parent = errors[0].parent
if (
parent is not None
and parent.validator == "anyOf"
# Use [1:] as don't have to check for first error as it was used
# above to define `parent`
and all(err.parent is parent for err in errors[1:])
):
errors = [min(errors, key=lambda x: len(x.message))]
return errors

This assumption conflicts directly with the channel wrappers, since the "anyOf" is how they are differentiated.

def generate_vegalite_channel_wrappers(fp: Path, /) -> ModuleDef[list[str]]:

Examples

These are all tested against (8fb2661)

import altair as alt

chart = alt.Chart().mark_point()

one_err = alt.value(1, bin=True)
two_err = alt.value(1, bin=True, axis=1)
two_err_2 = alt.value(1, bin=True, aggregate="sum")
three_err = alt.value(1, bin=True, axis=1, aggregate="sum")

We're trying to match alt.YValue.
The most helpful thing would be identifying all of the invalid properties.

Naively picking the shortest message is an unfit heuristic.
This can be observed by seeing how the final message changes in an unpredicatble way, simply by adding more invalid properties:

>>> chart.encode(y=one_err)
SchemaValidationError: `YValue` has no parameter named 'bin'
Existing parameter names are:
value

>>> chart.encode(y=two_err)
SchemaValidationError: '1' is an invalid value for `axis`. Valid values are of type `Mapping[str, Any] | None`.

>>> chart.encode(y=two_err_2)
SchemaValidationError: `YValue` has no parameter named 'value'
Existing parameter names are:
value

>>> chart.encode(y=three_err)
SchemaValidationError: '1' is an invalid value for `axis`. Valid values are of type `Mapping[str, Any] | None`.

Exploring the error tree

Diving into the error tree for two_err_2 gives us some more clarity:

Capturing an error

import altair as alt
from altair.utils.schemapi import SchemaValidationError

two_err_2 = alt.value(1, bin=True, aggregate="sum")
try:
    alt.Chart().mark_point().encode(y=two_err_2).to_dict()
except SchemaValidationError as err:
    some = err

Spec being validated

>>> some.instance
{'value': 1, 'bin': True, 'aggregate': 'sum'}

altair transformed message

Since (ea647eb) we now at least get YValue included:

>>> some.message
"`YValue` has no parameter named 'value'\n\nExisting parameter names are:\nvalue   \n\nSee the help for `YValue` to read the full description of these parameters"

The first error

Backtracking to what the message originally looked like from jsonschema.

I think it is pretty clear this is not the error we want:

>>> some._original_message 
"Additional properties are not allowed ('value' was unexpected)"

Parent message is more useful

Going back one level (.parent) starts to make more sense:

>>> some.parent.message
"{'value': 1, 'bin': True, 'aggregate': 'sum'} is not valid under any of the given schemas"

Significant backtracking

It is still possible to undo all of the work from validate_jsonschema

schemapi.validate_jsonschema

def validate_jsonschema(
spec,
schema: dict[str, Any],
rootschema: dict[str, Any] | None = None,
*,
raise_error: bool = True,
) -> jsonschema.exceptions.ValidationError | None:
"""
Validates the passed in spec against the schema in the context of the rootschema.
If any errors are found, they are deduplicated and prioritized
and only the most relevant errors are kept. Errors are then either raised
or returned, depending on the value of `raise_error`.
"""
errors = _get_errors_from_spec(spec, schema, rootschema=rootschema)
if errors:
leaf_errors = _get_leaves_of_error_tree(errors)
grouped_errors = _group_errors_by_json_path(leaf_errors)
grouped_errors = _subset_to_most_specific_json_paths(grouped_errors)
grouped_errors = _deduplicate_errors(grouped_errors)
# Nothing special about this first error but we need to choose one
# which can be raised
main_error: Any = next(iter(grouped_errors.values()))[0]
# All errors are then attached as a new attribute to ValidationError so that
# they can be used in SchemaValidationError to craft a more helpful
# error message. Setting a new attribute like this is not ideal as
# it then no longer matches the type ValidationError. It would be better
# to refactor this function to never raise but only return errors.
main_error._all_errors = grouped_errors
if raise_error:
raise main_error
else:
return main_error
else:
return None

I wouldn't want to solve the problem this way, but an example of backtracking to a useful context:

tree_nav = {
    err.parent.schema["anyOf"][idx]["$ref"].removeprefix("#/definitions/"): err.message
    for idx, err in enumerate(some.parent.context)
}
>>> tree_nav
{'PositionFieldDef': "Additional properties are not allowed ('value' was unexpected)",
 'PositionDatumDef': "Additional properties are not allowed ('aggregate', 'bin', 'value' were unexpected)",
 'PositionValueDef': "Additional properties are not allowed ('aggregate', 'bin' were unexpected)"}

So the hacky way to do this would be replacing the error we got from 'PositionFieldDef' with the one for 'PositionValueDef'.
For our {'value': 1, 'bin': True, 'aggregate': 'sum'}, we identify the two extra invalid properties "...('aggregate', 'bin' were unexpected)"

What would you like to happen instead?

I think #3750 offers an improvement, but really we need to fix this before raising in validate_jsonschema.
By that stage, we've dropped errors that are better candidates which makes fixing it later much more difficult.

Maybe this would be something to pick up when revisiting #3547, as that PR contains a lot of refactoring/redocumenting in tools.schemapi.schemapi.py

Which version of Altair are you using?

5.6.0dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug has-repro Issues with a Minimal, Reproducible Example maintenance
Projects
None yet
Development

No branches or pull requests

1 participant