Update natural language parser #48

ColeDCrawford · 2024-05-13T17:42:31Z

This PR updates the natural language parser to work with the 2018 spec. As noted in the EDTF docs,

This specification differs from the earlier draft as follows:

the unspecified date character (formerly lower case ‘u’) is superseded by the character (upper case) 'X';

Masked precision is eliminated;

the uncertain and approximate qualifiers, '?' and '~', when applied together, are combined into a single qualifier character '%';

“qualification from the left” is introduced and replaces the grouping mechanism using parentheses;

the extended interval syntax keywords 'unknown' and 'open' have been replaced with null and the double-dot notation ['..'] respectively;

the year prefix 'y' and the exponential indicator 'e', both previously lowercase, are now 'Y' and 'E' (uppercase); and

the significant digit indicator 'p' is now 'S' (uppercase).

Checklist for the PR:

Replace the unspecified date character u with X
Remove masked precision
Combine ?~ into % for uncertain + approximate
Use qualification from the left rather than parentheses for grouping
Replace unknown with null syntax in intervals
Replace open with .. syntax syntax in intervals
Replace y, e, and p with Y, E, and S syntax for exponential years and significant digits

In the parser: - Update regular expressions for SHORT_YEAR_RE and LONG_YEAR_RE to use X instead of x and u and Y instead of y - Replaced`unknown` with null as per the 2018 spec. It does not look like python-edtf currently has open intervals (`open` before, `..` now)? - Replaced `?~` with `%` In the tests: - eliminate masked precision - no u/x just X for unknown regardless of why the data is missing - replace unknown with null - replace ~? with %

ColeDCrawford · 2024-05-13T19:25:12Z

@aweakley do you want to remove the old parentheses grouping examples from BAD_EXAMPLES here? Those are from the old spec.

Do we have any examples of the difference between null and .. for extended intervals? It doesn't look like the current version of python-edtf has any open syntax?

aweakley · 2024-05-16T07:40:13Z

Do the parentheses ones make things more complex? I quite like that it's clear from the tests that inputs like that will raise an error, but if it'll make our lives easier to get rid of them then I'm happy with that.

aweakley · 2024-05-16T11:18:00Z

edtf/parser/tests.py

+    # Group qualification: a qualification character to the immediate right of a component applies
+    # to that component as well as to all components to the left.
+    # year, month, and day are uncertain and approximate
+    ('2004-06-11%', ('2004-06-11', '2004-06-09', '2004-06-13')),


Shouldn't this one be a broader range, given the year and month are also uncertain and approximate?

I think you are right. This points at some other potential implementation issues. In the upper_fuzzy() / lower_fuzzy() docs there is this example:

>>> e = parse_edtf('1912-04~') >>> e.lower_fuzzy()[:3] # padding is 100% of a month (1912, 3, 1) >>> e.upper_fuzzy()[:3] (1912, 5, 30)

1912-04~ would imply that the whole date should be padded ("year-month approximate"), right? Ending up with bounds of "1911-04-01" and "1913-05-31"? Whereas 1912-~04 (~ qualification to the left of 04) would be "month approximate" and we would pad by 100% of the month for bounds of "1912-3-1" and "1912-5-30".

1912-04~ and 2004-06-11% both are parsed to the L1 UncertainOrApproximate class, since the only qualification symbol is on the far right and applies to the entire EDTF object. 1912-~04 instead parses to the L2 PartialUncertainOrApproximate class. I think the problem with UncertainOrApproximate is that it currently uses the date.precision to determine the bounds. For 1912-04~, it determines there is month precision and applies 100% * 1 month of padding:

>>> from edtf.parser.grammar import parse_edtf as parse >>> e = parse('1912-04~') >>> e UncertainOrApproximate: '1912-04~' >>> e.ua UA: '~' >>> e.date.precision 'month' >>> e.ua._get_multiplier() 1.0 >>> e.lower_fuzzy()[:3] (1912, 3, 1) >>> d = parse('2004-06-11%') >>> d UncertainOrApproximate: '2004-06-11%' >>> d.date.precision 'day' >>> d.lower_fuzzy()[:3] (2004, 6, 9)

I think it's clear that _get_fuzzy_padding() method needs to be updated to account for the fact that the fuzziness is applied to the entire date, not just the rightmost part.

A remaining question: when the qualification symbol applies to multiple parts of the date, how should we express that in terms of fuzziness? 100% of the largest part (e.g. year) or also fuzz the other parts?

Ex: 2004-06-11% fuzzing just the year gives bounds of "2002-06-11" and "2006-06-11", vs fuzzing on all parts of the date gives bounds of "2002-04-09" and "2006-08-13" (200% of y m d given that it is both uncertain and approximate)

You can say 2004%-06-11 instead if you want the narrow range. So I think we should fuzz all parts to the left of the % symbol, and we'd get the broader "2002-04-09" - "2006-08-13" range.

ColeDCrawford · 2024-05-16T13:29:56Z

Do the parentheses ones make things more complex? I quite like that it's clear from the tests that inputs like that will raise an error, but if it'll make our lives easier to get rid of them then I'm happy with that.

Nope, they fail just fine as they currently are so I'm also happy to leave them.

Apply it to the entire date when a date is parsed as UncertainOrApproximate (L1 qualified)

ColeDCrawford · 2024-05-21T13:42:53Z

Not sure why the tests are hanging but they look like they are passing if you click in.

I've updated the tests, specifically EDTF level 1 "Qualification of a date (complete)" dates. These parse as UncertainOrApproximate classes.

aweakley · 2024-05-22T05:01:46Z

This looks great, thank you.

ColeDCrawford added 4 commits May 9, 2024 18:39

More updates of tests and English parser

4fd0782

Remove masked precision and unspecified from README

d23ff7b

Better grouping of group qualification tests

f1cd472

ColeDCrawford force-pushed the natural-language branch from 9c87fa1 to f1cd472 Compare May 13, 2024 19:18

aweakley reviewed May 16, 2024

View reviewed changes

ColeDCrawford added 4 commits May 16, 2024 09:46

Update year prefix in docs

e8b6433

Merge branch 'v5' into natural-language

ed2a3f6

Linting fixes

f74ae80

Fix qualification (complete) for L1 qualification

26b0afb

Apply it to the entire date when a date is parsed as UncertainOrApproximate (L1 qualified)

ColeDCrawford marked this pull request as ready for review May 21, 2024 13:49

aweakley merged commit 077eac5 into ixc:v5 May 22, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update natural language parser #48

Update natural language parser #48

ColeDCrawford commented May 13, 2024 •

edited

Loading

ColeDCrawford commented May 13, 2024 •

edited

Loading

aweakley commented May 16, 2024

aweakley May 16, 2024

ColeDCrawford May 16, 2024 •

edited

Loading

aweakley May 16, 2024 •

edited

Loading

ColeDCrawford commented May 16, 2024

ColeDCrawford commented May 21, 2024

aweakley commented May 22, 2024

Update natural language parser #48

Update natural language parser #48

Conversation

ColeDCrawford commented May 13, 2024 • edited Loading

ColeDCrawford commented May 13, 2024 • edited Loading

aweakley commented May 16, 2024

aweakley May 16, 2024

Choose a reason for hiding this comment

ColeDCrawford May 16, 2024 • edited Loading

Choose a reason for hiding this comment

aweakley May 16, 2024 • edited Loading

Choose a reason for hiding this comment

ColeDCrawford commented May 16, 2024

ColeDCrawford commented May 21, 2024

aweakley commented May 22, 2024

ColeDCrawford commented May 13, 2024 •

edited

Loading

ColeDCrawford commented May 13, 2024 •

edited

Loading

ColeDCrawford May 16, 2024 •

edited

Loading

aweakley May 16, 2024 •

edited

Loading