Support foreign keys property #84

ChasNelson1990 · 2023-01-31T11:19:15Z

Context:

We have multiple CKAN datasets that are composed of a set of resources.
In each of these datasets, one of these resources is a list of geographical regions.
All other resources in these datasets include a column for geographical region.
We want to use the foreignKeys prop (https://specs.frictionlessdata.io/table-schema/#foreign-keys) in frictionless to ensure that the choices in said column are present, and only present, in the geographical regions resource.

Good news:

This is supported upstream by frictionless 🎉

Less good news:

It requires treating two tables and two schemas as one Data Package

Idea No. 1:

One way to implement this is that the ckanext-validation could check schemas for foreignKeys and, if found, bundle both tables and both Table Schemas and send it all to frictionless's validate with type=package, e.g. we could have a _validate_data_package to match https://github.com/frictionlessdata/ckanext-validation/blob/9c23581d34289536139cc81f7b7ddb756dab64f6/ckanext/validation/jobs.py#L139

Idea No. 2:

An alternative is to have an entirely parallel set of package actions, e.g. https://github.com/frictionlessdata/ckanext-validation/blob/9c23581d34289536139cc81f7b7ddb756dab64f6/ckanext/validation/logic.py#L68 becomes package_validation_run and in these actions we treat the whole CKAN dataset (plus schemas) as a frictionless Data Package

We have a small amount of time to dedicate to this and would like to make a change that can be merged here (rather than maintaining our own fork) and so we'd really appreciate thoughts and ideas from existing maintainers and contributors.

All thoughts welcome 😄

The text was updated successfully, but these errors were encountered:

aivuk · 2023-01-31T15:53:29Z

Hi @ChasNelson1990, thanks for your idea and it makes sense to have this feature. We are currently working in a new update for the ckanext-validation that is going to be released still this week, then I would suggest you to first wait for the PR to be merged before starting working on it.

Regarding how to implement it, I still need to think a bit on the best approach but I'm prone to the idea no. 1. Let's see what @amercader has to say about it.

amercader · 2023-02-01T10:24:27Z

@ChasNelson1990 this certainly sounds interesting and I'm also leaning towards option 1, but I'm a bit unclear on how we would bundle both tables. Off the top of my head maybe something like this could work:

Looking at the foreignKeys spec, the outer resource is referenced by the name of an existing resource in the data package. In the context of CKAN I guess this could be the id of the resource that holds eg the geographic descriptor (or even a dataset_id:resource_id variant). Down the line there could be an UI to assign these in the schema editor but I wouldn't worry about it now.
Before running the validation job, we check if there are foreign keys and we get the metadata details of the resources needed (we need at least the url, is there any other property from the outer resource we need?)
We construct a virtual datapackage that include both resources (the original being validated and the referenced one(s)) and run the validation with type=package as you suggest

Things to consider:

What happens if the reference does not work (resource was deleted, wrong fields defined, etc)
Does validating FKs has an impact on the generated reports? can we still use the same current component?

@ChasNelson1990 let me know if this makes sense or I missed something

ChasNelson1990 · 2023-02-02T11:42:44Z

Thanks for the replies guys. I think everything makes sense and it feels like we are on the same page.

For some greater context @amercader here is what Fjelltopp do in our forks.

We have something like the following in our schema:

    "foreignKeys": [
        {
            "fields": "area_id",
            "reference": {
                "resource": "3_geojson_inputs",
                "fields": "area_id"
            }
        }
    ],

Where resource is the schema name of the other resource within the dataset.

If I am remembering correctly this may depend on some changes in our fork of ckanext-scheming allowing us to define a package schema with required resources and point to unique schemas for each resource, e.g. https://github.com/fjelltopp/unaids_data_specifications/blob/eb97a7d956936e3bfa6f3fe36d869249a8c4c525/package_schemas/country_estimates/3_country_estimates_23.json#L125. Those fork changes should not be necessary to enable foreign key validation here though.

We then check if the schema has foreignKeys and then get the appropriate data and pass it to (here) goodtables, e.g. https://github.com/fjelltopp/ckanext-validation/blob/8672ad5fa42405f0e8aaa40204cdec8676df13c9/ckanext/validation/jobs.py#L155. We know that the way we are doing this is very slow and we have never optimised this (and it still uses goodtables).

As you say, we could just provide a repeatable identifier that allows us to infer the resource and pass a 'virtual' data package to frictionless.

What happens if the reference does not work (resource was deleted, wrong fields defined, etc)

When we do a resource validation with frictionless it does a foreign key check and throws a few errors we could catch: https://github.com/frictionlessdata/framework/blob/ec2d8336dfa32f2396b38e4f33e3a7e9e4ff415a/frictionless/resource/resource.py#L718. This would cover a resource not being in the package, e.g. because it has been deleted? We would then want to return that as a "Warning" I would guess as that seems to be where we keep framework issues as opposed to validation issues?

And, while I am not sure I can find the line, I would presume it validates the schema as part of validating a resource? Because schema validation throws a known error if wrong fields are defined: https://github.com/frictionlessdata/framework/blob/ec2d8336dfa32f2396b38e4f33e3a7e9e4ff415a/frictionless/schema/schema.py#L350. Again this could be caught and returned as a warning.

Does validating FKs has an impact on the generated reports? can we still use the same current component?

This I have no idea about - is there an easy way to test / explore this?

tomeksabala · 2023-02-23T08:53:52Z

Hello @amercader and @aivuk . I'm working togheter with @ChasNelson1990 on the issue. I'd like to bump the issue a little bit and make sure we're on the same page about:

@amercader so would you agree that presented option 1 is a sensible solution? We're happy to create a draft PR with it for further review to continue a more detailed discussion. Adria, could you also confirm Chas is properly adressing your comments from earlier?
@aivuk What's the current timeline for finalizing Feature/80 new upload widget #83 ? We'd like to branch our work from this PR and would like to avoid possible rebase conflicts.

amercader · 2023-02-23T14:01:19Z

@tomeksabala Yes, option 1 is the more sensible and @ChasNelson1990 comments make sense, although the part that is less clear to me is the "get the appropriate data and pass it to goodtables". IIUC the proposed approach, we would somehow create a virtual datapackage that links to all tables involved as resources and let frictionless deal with the data loading.

But as you say all this would be much easier to discuss with actual code so a draft PR would be great.

With regards to #83 we are actively working on it so it's hard to give a timeline but the goal is to finish the uploader work in the next couple of weeks, merge to master and then carry out further work (schema editor ec) as new feature branches. In any case most of the work there affects the UI and custom endpoints, we are not touching the jobs module which is where I expect most of your work to be done so you can work of current master if you want.

kforenc · 2023-03-24T12:24:45Z

Hi @amercader and @aivuk

I'm one of the Fjelltopp team that is working on this feature.

We've spent sometime thinking this through and we'd like to get your opinion before we write any code. The plan is as follows:

The user can specify the foreignKey reference as a schema in one of 3 ways: URL, JSON object or the filename of a schema from a known directory. For the latter, we would introduce 'validation schema files' residing in some directory (e.g. "ckanext.validation.validation_schemas_directory"). This also allows reusing of schemas and is something we used heavily in our old fork.
To find the referenced resource we look at all resources in parent dataset for one that has the referenced schema (but only as a filename or a URL) - exactly as @ChasNelson1990 described above. Finding none or more than 1 resource would return an error.
We substitute all filename / URL references within the original schema with actual content and modify foreignKey to point to the generated name of the found resource so frictionless gets all the schema inlined. We pack everything into a virtual Package, validate it in frictionless and return the report.

We'd be really grateful to know if it makes sense for you :).

ChasNelson1990 · 2023-05-30T08:51:55Z

Hi @amercader @aivuk - have you guys seen the PR I created for this - #86?

Would be great to get some feedback and thoughts.

amercader · 2023-05-30T11:41:37Z

Sorry @ChasNelson1990 , we have very limited bandwidth right now but I'll try to look at this probably next week

ChasNelson1990 · 2023-05-30T11:47:48Z

No worries @amercader - we appreciate bandwidth issues! Next week would be great :-)

ChasNelson1990 mentioned this issue May 12, 2023

ADX-960 foreign key validation fjelltopp/ckanext-validation#98

Closed

17 tasks

ChasNelson1990 linked a pull request May 24, 2023 that will close this issue

Support foreign key validation #86

Open

ChasNelson1990 mentioned this issue Aug 8, 2023

ADX-949 ADX-960 ckanext-validation upgrade fjelltopp/ckanext-validation#103

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support foreign keys property #84

Support foreign keys property #84

ChasNelson1990 commented Jan 31, 2023

aivuk commented Jan 31, 2023

amercader commented Feb 1, 2023 •

edited

Loading

ChasNelson1990 commented Feb 2, 2023

tomeksabala commented Feb 23, 2023

amercader commented Feb 23, 2023

kforenc commented Mar 24, 2023 •

edited

Loading

ChasNelson1990 commented May 30, 2023

amercader commented May 30, 2023

ChasNelson1990 commented May 30, 2023

Support foreign keys property #84

Support foreign keys property #84

Comments

ChasNelson1990 commented Jan 31, 2023

aivuk commented Jan 31, 2023

amercader commented Feb 1, 2023 • edited Loading

ChasNelson1990 commented Feb 2, 2023

tomeksabala commented Feb 23, 2023

amercader commented Feb 23, 2023

kforenc commented Mar 24, 2023 • edited Loading

ChasNelson1990 commented May 30, 2023

amercader commented May 30, 2023

ChasNelson1990 commented May 30, 2023

amercader commented Feb 1, 2023 •

edited

Loading

kforenc commented Mar 24, 2023 •

edited

Loading