-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update natural language parser #48
Conversation
In the parser: - Update regular expressions for SHORT_YEAR_RE and LONG_YEAR_RE to use X instead of x and u and Y instead of y - Replaced`unknown` with null as per the 2018 spec. It does not look like python-edtf currently has open intervals (`open` before, `..` now)? - Replaced `?~` with `%` In the tests: - eliminate masked precision - no u/x just X for unknown regardless of why the data is missing - replace unknown with null - replace ~? with %
9c87fa1
to
f1cd472
Compare
@aweakley do you want to remove the old parentheses grouping examples from Do we have any examples of the difference between |
Do the parentheses ones make things more complex? I quite like that it's clear from the tests that inputs like that will raise an error, but if it'll make our lives easier to get rid of them then I'm happy with that. |
edtf/parser/tests.py
Outdated
# Group qualification: a qualification character to the immediate right of a component applies | ||
# to that component as well as to all components to the left. | ||
# year, month, and day are uncertain and approximate | ||
('2004-06-11%', ('2004-06-11', '2004-06-09', '2004-06-13')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this one be a broader range, given the year and month are also uncertain and approximate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. This points at some other potential implementation issues. In the upper_fuzzy() / lower_fuzzy()
docs there is this example:
>>> e = parse_edtf('1912-04~')
>>> e.lower_fuzzy()[:3] # padding is 100% of a month
(1912, 3, 1)
>>> e.upper_fuzzy()[:3]
(1912, 5, 30)
1912-04~
would imply that the whole date should be padded ("year-month approximate"), right? Ending up with bounds of "1911-04-01" and "1913-05-31"? Whereas 1912-~04
(~
qualification to the left of 04
) would be "month approximate" and we would pad by 100% of the month for bounds of "1912-3-1" and "1912-5-30".
1912-04~
and 2004-06-11%
both are parsed to the L1 UncertainOrApproximate
class, since the only qualification symbol is on the far right and applies to the entire EDTF object. 1912-~04
instead parses to the L2 PartialUncertainOrApproximate
class. I think the problem with UncertainOrApproximate
is that it currently uses the date.precision
to determine the bounds. For 1912-04~
, it determines there is month precision and applies 100% * 1 month of padding:
>>> from edtf.parser.grammar import parse_edtf as parse
>>> e = parse('1912-04~')
>>> e
UncertainOrApproximate: '1912-04~'
>>> e.ua
UA: '~'
>>> e.date.precision
'month'
>>> e.ua._get_multiplier()
1.0
>>> e.lower_fuzzy()[:3]
(1912, 3, 1)
>>> d = parse('2004-06-11%')
>>> d
UncertainOrApproximate: '2004-06-11%'
>>> d.date.precision
'day'
>>> d.lower_fuzzy()[:3]
(2004, 6, 9)
I think it's clear that _get_fuzzy_padding()
method needs to be updated to account for the fact that the fuzziness is applied to the entire date, not just the rightmost part.
A remaining question: when the qualification symbol applies to multiple parts of the date, how should we express that in terms of fuzziness? 100% of the largest part (e.g. year) or also fuzz the other parts?
- Ex:
2004-06-11%
fuzzing just the year gives bounds of "2002-06-11" and "2006-06-11", vs fuzzing on all parts of the date gives bounds of "2002-04-09" and "2006-08-13" (200% of y m d given that it is both uncertain and approximate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can say 2004%-06-11
instead if you want the narrow range. So I think we should fuzz all parts to the left of the % symbol, and we'd get the broader "2002-04-09" - "2006-08-13" range.
Nope, they fail just fine as they currently are so I'm also happy to leave them. |
Apply it to the entire date when a date is parsed as UncertainOrApproximate (L1 qualified)
Not sure why the tests are hanging but they look like they are passing if you click in. I've updated the tests, specifically EDTF level 1 "Qualification of a date (complete)" dates. These parse as |
This looks great, thank you. |
This PR updates the natural language parser to work with the 2018 spec. As noted in the EDTF docs,
Checklist for the PR:
u
withX
?~
into%
for uncertain + approximateunknown
withnull
syntax in intervalsopen
with..
syntax syntax in intervalsy
,e
, andp
withY
,E
, andS
syntax for exponential years and significant digits