As a user, I want custom feature correlation to happen by default #277

epag · 2024-08-21T20:11:31Z

Author Name: James (James)
Original Redmine Issue: 96602, https://vlab.noaa.gov/redmine/issues/96602
Original Date: 2021-09-23

Given an evaluation where the feature correlation dimension is unknown
When the evaluation proceeds to feature correlation
Then it should assume a @Custom@ feature dimension
So that a user's evaluation can succeed without the awkward declaration, @featureDimension="custom"@

Redmine related issue(s): 97188, 98836, 104397, 104482

epag · 2024-08-21T20:11:32Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2021-09-23T10:47:02Z

Discussion from #95870-15 - #95870-21.

Should be sufficient to qualify in the declaration schema, I think, since @featureDimension@ is an xml attribute, which admits a default.

        <xs:attribute name="featureDimension" type="featureDimension" default="custom" />
</code>

epag · 2024-08-21T20:11:52Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-04T14:03:52Z

This is probably a simple change that we should make soon. I don't know whether it will break other things, though. I hope not, but I have a side-eye on #98836, just based on the words that ticket contains rather than any real analysis of it. edit: in other words, perhaps it is orthogonal, perhaps not.

epag · 2024-08-21T20:12:13Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-04T16:41:55Z

This should be a simple thing. I guess we'll find out. But my expectation is that #96602-2 will allow scenario003 to succeed when declared like this:

    <inputs>
        <right>
            <type>ensemble forecasts</type>
            <source>/wres/systests/dist/data/drrc2ForecastsOneMonth</source>
        </right>
        <left>
            <type>observations</type>
            <source>/wres/systests/dist/data/DRRC2QINE.xml</source>
        </left>
    </inputs>

    <pair label="Pairs by location and lead time">
        <feature left="DRRC2HSF"/>
        <leadHours minimum="1" maximum="24" />
        <leadTimesPoolingWindow>
            <period>0</period>
            <frequency>1</frequency>
            <unit>hours</unit>
        </leadTimesPoolingWindow>
    </pair>
</code>

Instead of this:

2022-05-04T16:41:16.215+0000 ERROR Main Operation 'execute' completed unsuccessfully
wres.pipeline.InternalWresException: Could not complete project execution
	at wres.pipeline.Evaluator.evaluate(Evaluator.java:323)
	at wres.pipeline.Evaluator.evaluate(Evaluator.java:182)
	at wres.MainFunctions.execute(MainFunctions.java:133)
	at wres.MainFunctions.call(MainFunctions.java:95)
	at wres.Main.main(Main.java:129)
	at wres.MainLocal.main(MainLocal.java:446)
Caused by: wres.pipeline.WresProcessingException: Encountered an error while processing evaluation 'MJgBD7L3AO-SfH43p71NPZntFx0': 
	at wres.pipeline.ProcessorHelper.processEvaluation(ProcessorHelper.java:274)
	at wres.pipeline.Evaluator.evaluate(Evaluator.java:298)
	... 5 common frames omitted
Caused by: wres.pipeline.WresProcessingException: Project failed to complete with the following error: 
	at wres.pipeline.ProcessorHelper.processProjectConfig(ProcessorHelper.java:527)
	at wres.pipeline.ProcessorHelper.processEvaluation(ProcessorHelper.java:212)
	... 6 common frames omitted
Caused by: wres.config.ProjectConfigException: Near line 10, column 15: Unable to determine what geographic feature dimension to use for source, cannot ask service without a feature dimension inferred from source interface (e.g. WRES knows  interface usgs_nwis uses  usgs_site_code dimension) or explicitly declared in the 'featureDimension' attribute of left/right/baseline. Valid values include: 'nws_lid', 'usgs_site_code', 'nwm_feature_id', 'custom'
	at wres.io.geography.FeatureFinder.determineDimension(FeatureFinder.java:1009)
	at wres.io.geography.FeatureFinder.fillFeatures(FeatureFinder.java:129)
	at wres.pipeline.ProcessorHelper.processProjectConfig(ProcessorHelper.java:373)
	... 7 common frames omitted

epag · 2024-08-21T20:12:33Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-04T16:53:09Z

Makes #96602-7 work, as expected. Builds too, so whatever unit/integration tests are there also pass. Trying the system tests.

epag · 2024-08-21T20:12:53Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-04T17:09:44Z

System tests pass too. It is possible that this might break some other stuff (e.g., WRDS interaction) for which our test coverage is poor, but I am going to risk it because it makes a simple thing simple, which is a win.

epag · 2024-08-21T20:13:14Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-04T17:14:05Z

commit:wres|ba09840947a34ece85d2d3fd714408ac21cbe421.

We should probably push through some evaluations that use WRDS features and/or thresholds to make sure it didn't break something else, but that can be UAT.

epag · 2024-08-21T20:13:34Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T18:55:40Z

The more I think about the approach of "set the default value to custom" the more problems I think there can be.

I agree with the original post's goal that the software

assume a custom feature dimension

However, that is not the same thing as saying "set a default value to custom."

The presence of a value specified by a user in that attribute can/should override the feature dimension detected by the software. If the attribute is null or blank, then the software looks at the @interface@ for a clue. If it cannot get it from there, then it should assume it is a custom dimension. But if the user specifies "custom" then that overrides (edit: does it really or should it?) whatever is detected. I think what is happening is that by setting a default value of "custom" on the declaration, the unintended consequence is that the software interprets this as the user explicitly specifying custom, and this overriding any other kind of feature dimension detection.

epag · 2024-08-21T20:13:55Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T18:58:07Z

In other words, it is the software that needs to assume "custom" when nothing else is available rather than making it appear like the user said "this is custom!"

epag · 2024-08-21T20:14:15Z

Original Redmine Comment
Author Name: Hank (Hank)
Original Date: 2022-05-05T19:09:30Z

Yeah, I think this is going to break some evaluations. Certainly my own evaluations, but also perhaps those of users. I'm in the midst of trying to figure out what I did wrong in #104482, but, after that, I'll look at some production runs to see if any of those will break as a result.

Thanks,

Hank

epag · 2024-08-21T20:14:36Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T19:34:13Z

So the method in the netCDF writer does not follow the rules I stated a few moments ago, it looks like it would prioritize the auto-detected feature dimension if there is exactly one unanimous interface declaration for the whole dataset, and then fall back on the declared, or not be able to discern it at all.

The "I want to join multiple datasets from multiple datasources with multiple feature dimensions into a single left dataset" use case may be very difficult given that WRES assumes a single feature dimension for a given dataset. So you would have to mash it into a custom dataset, more than likely. Or use a NESDIS/HADS id or something.

epag · 2024-08-21T20:14:56Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T19:41:20Z

Jesse wrote:

The more I think about the approach of "set the default value to custom" the more problems I think there can be.

I agree with the original post's goal that the software

assume a custom feature dimension

However, that is not the same thing as saying "set a default value to custom."

The presence of a value specified by a user in that attribute can/should override the feature dimension detected by the software. If the attribute is null or blank, then the software looks at the @interface@ for a clue. If it cannot get it from there, then it should assume it is a custom dimension. But if the user specifies "custom" then that overrides whatever is detected. I think what is happening is that by setting a default value of "custom" on the declaration, the unintended consequence is that the software interprets this as the user explicitly specifying custom, and this overriding any other kind of feature dimension detection.

That may be the case, but I think something smells in there somewhere, beyond the fact that no automated tests failed.

I think it is perhaps the "looking for a clue" aspect and what we should regard as canonical. For example, declaration of time scale information should not be regarded as canonical when the source says something different, rather the source is canonical. If there is a better source of information than what a user declared, then that source of information should be canonical.

But I accept that this may have broken stuff, sure. Again, I was relying purely on the automated tests here, as noted earlier, and they are somewhat lacking, probably.

The first thing I would advocate is that we try to write the expected behavior into automated tests (edit: which requires us to immediately address/agree the behavior in concrete terms, which really helps in this sort of complicated scenario) and then we try to make the software implement that behavior, i.e. this is a good case for TDD. But I am still wondering about what is canonical here and I sense a smell, or a thread to be pulled.

epag · 2024-08-21T20:15:17Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T19:44:58Z

So, yeah, I am open to reverting this trivial change and acknowledging this as more complex. I am not terribly surprised, either that tests didn't fail or that it is more complex. But the complexity is a problem too, I think.

epag · 2024-08-21T20:15:37Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T19:50:01Z

Yes, well, we have not told users much about @featureDimension@ because we have inconsistency in the software regarding its use and inconsistency in what we want it to be, assumptions that are required for it to work, etc. And now we're talking about totally heterogeneous data in one dataset which will upend a lot of stuff too.

epag · 2024-08-21T20:15:57Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T19:51:10Z

James wrote:

So, yeah, I am open to reverting this trivial change and acknowledging this as more complex. I am not terribly surprised, either that tests didn't fail or that it is more complex. But the complexity is a problem too, I think.

I think for now it is safest to undo the one liner.

The assumptions underneath the ability to fill in that one value are not as strong as they used to be. (Edit: I mean for @featurefinder@ to be able to say, ah, you said "FEATURE_X" on the left, I can safely assume you meant exactly one feature and safely assume there is exactly one feature dimension on the left and exactly one feature dimension on the right, and those feature dimensions are identical, therefore I can fill in "FEATURE_X" for you on the right)

epag · 2024-08-21T20:16:18Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T19:53:00Z

Right, right, it's all good - I was kinda hoping that it was going to be a quick win, but I am not surprised that it turned out to be a dumpster fire.

At this point, probably the simplest thing is to revert the change and then pull on the thread properly, but I am not volunteering to do that right now because; a) dumpster fire; and b) there are more important dumpsters on fire elsewhere :-)

epag · 2024-08-21T20:16:38Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T19:53:22Z

It is simplest to communicate the one rule to users: be explicit every time, even if it means a bit of repetition in some cases.

epag · 2024-08-21T20:16:59Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T19:54:01Z

Undoing it and then placing this back on hold/backlog...

epag · 2024-08-21T20:17:19Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T19:54:47Z

If you think it's a dumpster fire now, you should look at the pre-WRES 5.0 feature handling haha. In any case, I don't think it's a dumpster fire now, nor do I think it was a dumpster fire then. It was an incremental improvement on what was there but in order to provide automation or convenience, you must have strong invariants available to work with. If you want to be all loosy goosy with your assumptions, you can't provide automation. Again, safest thing is to tell users to be explicit.

epag · 2024-08-21T20:17:40Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T20:01:52Z

I am not talking about the declaration or treatment of features in general, the big picture, that has vastly improved, I am talking about the birds nest of inconsistencies across the codebase and the lack of automated testing that can provide immediate feedback along the lines of "that ain't going to work". Dumpster fire is probably too strong, but it is frustrating when the software cannot provide feedback and we fall back on tribal knowledge (that even I don't have as a tribe member, to some extent).

epag · 2024-08-21T20:18:00Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T20:05:14Z

commit:wres|75ea16bb198e13a1d2be702383bb693eae66367d.

epag · 2024-08-21T20:18:20Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T20:28:56Z

I am trying to find your proposed rules for how to deal with M:N:O feature matching, e.g. when only "A" is declared on the left, but "A" maps to "B" and "C" on the right, and furthermore "B" maps to "D" and "E" on the baseline, and "C" on the right maps to "F" and "G" on the baseline, and suppose "B" is declared on the right elsewhere but nothing on the left for that one. Should "A" be filled in there? Or should the "B" be dropped because it is already taken care of with the map of "A" to "B"? Or should "A" be filled in with both "D" and "E" on the baseline? Wouldn't that result in a duplicate? Etc.

epag · 2024-08-21T20:18:41Z

Original Redmine Comment
Author Name: Jesse (Jesse)
Original Date: 2022-05-05T20:36:05Z

Found it I think in #82867.

epag · 2024-08-21T20:19:01Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T21:51:04Z

Perhaps we need to start even before this.

I think the starting point is to distinguish between feature identification and feature correlation.

Feature identification means that a feature has an identity. An identity means that it has a unique name in a naming system. We define a naming system as an "authority", but an authority could be responsible for more than one naming system, strictly speaking. Anyway, I proceed with the assumption that one authority has one naming system and hence that a name and an authority is sufficient for identification (for what it's worth, this is why I sometimes refer to both an "authority" and a "type", where authority might be "NWS" and type might be "handbook 5 identifier"). As an aside, our database data model does not engage in feature identification (edit: or, more precisely, the @feature_id@ is not based on a name/authority pairing).

The main application of feature identification is to speak in authoritative terms with other services and to communicate about statistics in authoritative terms, geospatially speaking.

Feature correlation means that one feature is paired with another feature or tripled with two other features (or, forgive us our sins, tupled with N others if we allow more sides in future, like a multivariate evaluation...).

The main application of feature correlation is to produce pairs of time-series events.

Feature identification and feature correlation are orthogonal. You can correlate features with unknown authorities. I can declare that ("A","B") is a feature correlation. Of course, to be useful we need more. Perhaps feature identification lends confidence to feature correlation, but they are otherwise orthogonal.

The crucial thing about a feature correlation is that it has an order or orientation. This is crucial because we use a feature correlation to pair time-series data and our time-series data is sided. This is how a declared feature correlation becomes useful. A feature tuple is not a set, it is a tuple. There is a @left@ side, a @right@ side and a @baseline@ side. This orientation is explicitly part of the declaration. We declare a name of @left="A"@ and so on. A declaration of @left="A"@ has the meaning that "there is a feature named 'A' whose naming authority is the naming authority used by the @left@ data sources". Thus, we can use a feature correlation to pair time-series without feature identification. We know that "A" follows the naming authority of the @left@ data (and likewise for the other sides), even if we don't know what that naming authority is. We can be sure that there is exactly one @left@ authority and one meaning of "A" in that context, even if there are different opinions about some of the other attributes of "A", like its wgs84 coordinates (which could be a problem in qualifying statistics, but not in conducting pairing of time-series). Of course, @left="A"@ can be part of many feature correlations too.

Designating something as "custom" is interesting in relation to the above. It is potentially ambiguous in two ways. First, it could mean that the authority is known by the user, but unspoken (e.g., because no suitable authority is enumerated) or that it is unknown by the user. Second, it could mean that there is one authority or N authorities. For example, authorities "foo" and "bar" are both custom authorities, but they are not the same authority. My assumption about the intended meaning is that: "the custom authority is one thing per evaluation and it is known to the user, but it is not named".

What does a designation of "custom" help to disambiguate, then? What does it support, not otherwise supported by an interpretation that the left feature names have a left feature authority, the right names have a right authority and so on? It doesn't help with feature identification, of course, so I am assuming it is intended to help with feature correlation. Or is there something else besides identification and correlation that I am missing?

epag · 2024-08-21T20:19:22Z

Original Redmine Comment
Author Name: James (James)
Original Date: 2022-05-05T21:57:37Z

( And yes that is very long and I am not yet sure whether it will be useful or is even in the right ticket, but I am now (re)engaging with this topic and perhaps it is good to start with the basics and to see how much common ground exists there. )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a user, I want custom feature correlation to happen by default #277

As a user, I want custom feature correlation to happen by default #277

epag commented Aug 21, 2024 •

edited

Loading

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

As a user, I want custom feature correlation to happen by default #277

As a user, I want custom feature correlation to happen by default #277

Comments

epag commented Aug 21, 2024 • edited Loading

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024

epag commented Aug 21, 2024 •

edited

Loading