Improve handling of Roman numerals #1

waldoj · 2013-06-12T18:53:21Z

@twneale points out a use case that is not allowed for, but that should be:

(h) Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    (i) Integer tincidunt, sem eu pretium condimentum.
    (ii) Sed dui justo, euismod nec mattis a, aliquet quis ante.
(i) Nulla dapibus sem et ligula consectetur vitae sagittis arcu varius.
(j) Proin a mauris sit amet enim ullamcorper ultricies vitae id lectus.

This is a non-trivial modification, because it requires statefulness—an understanding, upon “realizing” that it’s in the midst of a list of Roman numerals, that it must backtrack, reevaluate where that list began, and modify the ancestry of those subsections accordingly. If it encountered only a single subsection of (i), that's especially problematic, because it’s two “i”s in a row, and there’s no hint available that one of them should be a Roman numeral and, thus, a child of (h). That requires an understanding of order (alphabetic, numeric, and Roman numeric) that is not currently present in this, but that seems conceptually straightforward to add.

Thom has found the example problem within the U.S. Code, so it’s not merely hypothetical.

Realistically, this is two problems. The first is the ability to recognize and handle Roman numerals properly, which is to say to understand that "i" isn't necessarily the same as "i". Second is the ability to look ahead and understand the unusual-but-extant problem of the use of the Roman numeral "i" following immediately "h."

The text was updated successfully, but these errors were encountered:

This is a step towards solving issue #1, which requires stateful knowledge of the ordinal values of prefixes.

Per #1. This is a lazy solution, but it's also very achievable, so it has that going for it.

twneale · 2013-06-12T19:36:33Z

This is an interesting puzzle for sure. If you're at all inclined to work
in Python on this one, I have some oldish code lying around that does a
reasonably effective job modeling these ("enumerations", I'm in the habit
of calling them):
https://github.com/unitedstates/uscode/blob/master/uscode/schemes.py

It needs some upkeep, which I'd be happy to provide, but it's basic use is
to model an enumeration like "a" or "1" or "i" and also "a-3" or "ccc" or
"dd-3", etc. It tries to break them into tokens ("a-1" --> ["a", "-" "1"])
that can be ordered, or at least the order of which can be guessed, so that
it's possible to say things like "a-3 precedes a-5." But I sense this is
for State Decoded and you'll need PHP. But for what it's worth, this is an
issue that seems to keep coming up for me too in random ways, so I'm in if
you want to collaborate.

One slightly creative way to come at this might be to just detect ambiguity
in your parser. If so, optionally have the program write out a
human-editable markup file that can be deserialized back into a tree. Then
most cases would be covered, but truly weird things could still be flagged
and given manual attention?

On Wed, Jun 12, 2013 at 2:53 PM, Waldo Jaquith [email protected]:

@twneale https://github.com/twneale points out a use casehttps://twitter.com/twneale/status/306080682491396096that is not allowed for, but that should be:

(h) Lorem ipsum dolor sit amet, consectetur adipiscing elit.
(i) Integer tincidunt, sem eu pretium condimentum.
(ii) Sed dui justo, euismod nec mattis a, aliquet quis ante.
(i) Nulla dapibus sem et ligula consectetur vitae sagittis arcu varius.
(j) Proin a mauris sit amet enim ullamcorper ultricies vitae id lectus.

This is a non-trivial modification, because it requires statefulness—an
understanding, upon “realizing” that it’s in the midst of a list of Roman
numerals, that it must backtrack, reevaluate where that list began, and
modify the ancestry of those subsections accordingly. If it encountered
only a single subsection of (i), that's especially problematic, because
it’s two “i”s in a row, and there’s no hint available that one of them
should be a Roman numeral and, thus, a child of (h). That requires an
understanding of order (alphabetic, numeric, and Roman numeric) that is not
currently present in this, but that seems conceptually straightforward to
add.

Thom has found the example problem within the U.S. Code, so it’s not
merely hypothetical.

Realistically, this is two problems. The first is the ability to recognize
and handle Roman numerals properly, which is to say to understand that "i"
isn't necessarily the same as "i". Second is the ability to look ahead and
understand the unusual-but-extant problem of the use of the Roman numeral
"i" following immediately "h."

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1
.

waldoj · 2013-06-12T19:45:46Z

Because this is to be used within The State Decoded, unfortunately it really should be PHP. The good news is I've solved this conceptually—it only remains to execute it. I'm going to break up what's now one pass into two, with the second pass looking both back and forward to see if the identified structural unit is preceded and followed by the expected identifiers, giving special attention to any Roman numerals that could plausibly be letters, and vice-versa. "x" should have been preceded by a "w," and followed by a "y" (if, indeed, the document continues to that point). If "x" is preceded by an "ix," then we know that it's actually a Roman numeral. That's why I'm storing the list of viable identifiers in order, which I'm barely using at this point. All of which sounds a lot like what you've already done in schemes.py—that seems like a good sign. :)

The trick is going to be recognizing that hierarchical documents don't necessary proceed properly, and being able to deal with that. Mistakes happen, as I'm sure you've seen in the structures of laws. Having a human have to touch it would be a worst-case scenario—as you can imagine, that could be a real mess when importing 40,000 laws—but I think you're right, and it's inevitable that such circumstances are possible.

waldoj added a commit that referenced this issue Jun 12, 2013

Store a list of prefixes, in order

392452e

This is a step towards solving issue #1, which requires stateful knowledge of the ordinal values of prefixes.

waldoj added a commit that referenced this issue Jun 12, 2013

Determine whether we're starting a list of Roman numerals

69f359f

Per #1. This is a lazy solution, but it's also very achievable, so it has that going for it.

waldoj mentioned this issue Jun 12, 2013

Fix improperly parsed subsections statedecoded/statedecoded#238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of Roman numerals #1

Improve handling of Roman numerals #1

waldoj commented Jun 12, 2013

twneale commented Jun 12, 2013

waldoj commented Jun 12, 2013

Improve handling of Roman numerals #1

Improve handling of Roman numerals #1

Comments

waldoj commented Jun 12, 2013

twneale commented Jun 12, 2013

waldoj commented Jun 12, 2013