Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FullCaseCitation.group have badly formatted reporter #147

Open
overmode opened this issue Mar 24, 2023 · 12 comments
Open

FullCaseCitation.group have badly formatted reporter #147

overmode opened this issue Mar 24, 2023 · 12 comments

Comments

@overmode
Copy link
Contributor

overmode commented Mar 24, 2023

Hey,
Thanks again for the library, we're using it everyday :)

We're currently trying to come up with a unique string representation for citations, one that would be unique for all citations being semantically the same, but would still be human-readable.

We've tried to use the corrected_citation(), but it fails in some cases, here are a few examples where the output is the same as matched_text() : 5 U.S.C. §702, 5 U.S.C. § 702, 5 U.S.C. §§702, 5 U.S.C. §702
Note that the corrected_citation_full function also fails in mapping these examples to the same string.

Because these limitations seem to be hard to fix, we gave up on the human-readable requirement and just want to have a unique representation, so exactly what comparison_hash does.

This issue is about fixing a corner case that we spotted :

Cit1 : FullCaseCitation('506 U.S. 534', groups={'volume': '506', 'reporter': 'U.S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='539', year='1993', court='scotus', plaintiff='STATES', defendant='STINSONStates', extra=None))
Cit2 : FullCaseCitation('506 U. S. 534', groups={'volume': '506', 'reporter': 'U. S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff='STATES', defendant='STINSONStates', extra=None))]
Cit1.corrected_citation() : 506 U.S. 534
Cit2.corrected_citation() : 506 U.S. 534

Here the corrected citation for the second string has the corrected reporter, but this was not propagated to the groups attribute, thus the comparison_hash is different.
Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

I'm happy to open a PR if you think it makes sense

@mlissner
Copy link
Member

Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

I'm actually not sure how groups is used these days, so I'm not sure how to respond to this without looking at the code. If groups is used to re-create the linkified text, then you don't want corrections in there. We did that once years ago, and it was a mistake because sometimes the corrections are imperfect, and that can be a real problem if you take a good citation with a bad reporter and convert it to something else (with a valid, but wrong reporter).

Can you remind me how groups is used? Depending on that, it's either working correctly, or broken, like you say.

@overmode
Copy link
Contributor Author

  • The following insights were derived from looking for '.groups' in the repo
    • The reporter is not used in resolve for shortcase citations, nor in any other resolve method
    • Seems like you explicitly wanted groups['reporter'] to store the reporter_found field, as shown in this update from 2021-06-14 (search for 'groups')
  • based on a search for 'reporter' in the repo, there seem to be no reason why we would need to store the original reporter in groups
  • About annotating citations, this is done using the span, which is stored in the Token class and based on what was matched by the regex. it seems like you never reuse the reporter or any other field in groups to recover the span. Thus, changing the content of reporter in groups seems to be safe.

@mattdahl
Copy link
Contributor

mattdahl commented Mar 28, 2023

Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

The Citation.groups dict is set on init to the content of the Token.groups dict, which is itself generated from calling groupdict() on the matched regex object during parsing. (Hence the variable name "groups" bubbles up, though it is a bit of a misnomer at the Citation object level.) So there is no way to "correct" the reporter prior to writing it into the groups dict; indeed, the correction process presupposes that the matched reporter key is already in the groups dict.

Would it work instead for you to change how CitationBase objects are hashed here? https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L117

I.e., instead of making a tuple of self.groups.items(), make a tuple of self.groups['volume'], self.corrected_reporter(), self.groups['page']? N.b. this might have to be overridden at the ResourceCitation level instead because not all citation object types have a corrected_reporter() method.

@overmode
Copy link
Contributor Author

overmode commented Mar 28, 2023

@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the post_init function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go

@mlissner
Copy link
Member

I'm over my head and explicitly bowing out, if you guys have a sense of the direction to go, but if the conversation goes stale or you need a BDFL, please feel free to flag it for me and I can try to understand what y'all are talking about. :)

@mattdahl
Copy link
Contributor

mattdahl commented Mar 28, 2023

@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the post_init function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go

Yeah, I agree it would be technically possible there, but the idea of the groups dict is to store the actual results from calling m.groupdict(). I can't give a specific example right now of why we need to retain this, but conceptually it seems dangerous to modify the dict contents post-hoc to be something that wasn't actually matched from the regex. So I would suggest changing the hash function instead, if that's all right with you!

@overmode
Copy link
Contributor Author

Noted 👍 By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?

@overmode
Copy link
Contributor Author

Created the PR

@overmode
Copy link
Contributor Author

Whoops, it seems like we have another problem with the comparison hash.
My colleague @alessandro-ciffo found that the comparison hash changes every time you run a new script.

The following code :

frm eyecite import get_citations
cit_str = "482 S.E.2d 805"
cit = get_citations(cit_str)[0]
print(cit.comparison_hash())

Will print a new value every time you run it, because python adds randomness to the hash function (as stated here).

What do you think of using the hashlib library instead ?
We could use something like

from hashlib import sha1
import json
def comparison_hash(self) :
    tup = (str(type(self)), tuple(self.groups["volume"], self.corrected_reporter(), self.groups["page"]))
    return sha1(json.dumps(tup).encode('utf-8'))

(Probably not a valid syntax but you get the idea)

@mattdahl
Copy link
Contributor

Ah, interesting! It does seem to make sense to me that the hashes should be reproducible across runs. The hashlib module (part of the standard library, right?) seems reasonable to me, but I would summon @mlissner back for this

@mattdahl
Copy link
Contributor

By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?

You mean the FullLawCitation corner cases with 5 U.S.C. §702, 5 U.S.C. § 702, 5 U.S.C. §§702, 5 U.S.C. §702, etc.? I agree that it does seem like those should all be normalized to a common string (rather than just returning the result of matched_text()). The spacing issue could be fixed easily, but the double § symbol might be more complicated. We would have to make sure that we only normalized §§ to § in the case where there is indeed only one section listed (as in the example). In any case, the trick would be to override the corrected_citation() method on the FullLawCitation class here: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L280

@mlissner
Copy link
Member

mlissner commented Mar 29, 2023

Sure, using the hashlib seems fine to me. We have some simple wrappers for it in CL:

https://github.com/freelawproject/courtlistener/blob/main/cl/lib/crypto.py

I think I'd suggest SHA256 for this, though md5 would produce shorter hashes and be negligibly faster. It's less secure, but that doesn't matter here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants