Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss test data sets #87

Open
mlissner opened this issue Aug 19, 2021 · 5 comments
Open

Discuss test data sets #87

mlissner opened this issue Aug 19, 2021 · 5 comments

Comments

@mlissner
Copy link
Member

Also, apart from running the included tests, do you have a test dataset you can recommend?

Originally posted by @step21 in #86 (comment)

@mlissner
Copy link
Member Author

(Just splitting this off here, so we can keep the other issue narrow.)

@step21, if you're just looking for some simple datasets to play with you could use the API at https://api.case.law or you can use the API at https://www.courtlistener.com/api/. The other thing is, if you are just experimenting for the sake of the JOSS review, you could just copy/paste some legal text and throw eyecite at it. For example, you could grab some text from here a recent SCOTUS opinion:

https://www.supremecourt.gov/opinions/slipopinion/20

Does that help?

@jcushman
Copy link
Contributor

Here's an example of extracting cites from all of the case.law cases for New Mexico, if it helps to have a larger dataset to play with:

# pip install eyecite requests

import shutil
import zipfile
import lzma
import json
import requests
from pathlib import Path
from eyecite import get_citations

# download data file (66MB) if not already downloaded
download_url = "https://case.law/download/bulk_exports/latest/by_jurisdiction/case_text_open/nm/nm_text.zip"
output_path = "nm_text.zip"
if not Path(output_path).exists():
    print("Downloading to %s ..." % output_path)
    with open(output_path, 'wb') as out_file:
        shutil.copyfileobj(requests.get(download_url, stream=True).raw, out_file)
    print("Done.")

# yield case texts from data file
def get_case_texts():
    with zipfile.ZipFile(output_path, 'r') as zip_archive:
        xz_path = next(path for path in zip_archive.namelist() if path.endswith('/data.jsonl.xz'))
        with zip_archive.open(xz_path) as xz_archive, lzma.open(xz_archive) as jsonlines:
            for line in jsonlines:
                record = json.loads(str(line, 'utf-8'))
                case_body = record['casebody']['data']
                case_text = "\n".join([case_body['head_matter']]+[opinion['text'] for opinion in case_body['opinions']])
                yield record['frontend_url'], case_text

# extract citations
for url, case_text in get_case_texts():
    cites = get_citations(case_text)
    print(url, [c.corrected_citation() for c in cites])

@step21
Copy link

step21 commented Sep 29, 2021

Thanks! It's doing things, so that's a good start. It was mostly for the review and I am mostly satisfied, but just to be sure I ran this anyway, and it got a key error. As this key is not in your code, it must be sth else...?

Downloading to nm_text.zip ...
Done.
https://cite.case.law/nmca/2013/039/4191100/ ['2013-NMCA-039', '107 N.M. 236', '755 P.2d 80', '2007-NMSC-002', '141 N.M. 21', '150 P.3d 971', '1998-NMSC-046', '126 N.M. 396', '970 P.2d 582', '2009-NMCA-081', '146 N.M. 717', '213 P.3d 1146', 'Id.', '2009-NMCA-015', '145 N.M. 533', '202 P.3d 126', '99 N.M. 302', '657 P.2d 629', '2010-NMCA-060', '148 N.M. 367', '237 P.3d 111', '2010-NMCA-085', '148 N.M. 627', '241 P.3d 628', '2010-NMCA-060', '534 U.S. 19', '111 N.M. 319', '805 P.2d 88', '2005-NMCA-061', '137 N.M. 420', '112 P.3d 281', '2000-NMCA-010', '128 N.M. 648', '996 P.2d 911', '1999-NMCA-011', '126 N.M. 460', '971 P.2d 851', '839 F. Supp. 80', '498 P.2d 1240', '2010-NMCA-060', '2008-NMSC-022', '143 N.M. 740', '182 P.3d 121', '2005-NMCA-061', '137 N.M. 420', '112 P.3d 281', '847 F.2d 435', '186 F.2d 683', '388 So. 2d 128', '2006-NMCA-015', '139 N.M. 48', '128 P.3d 476', '107 N.M. at 237', '755 P.2d at 81', 'Id.', '755 P.2d at 82', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', '755 P.2d at 84', 'Id.', 'Id.', '755 P.2d at 83', 'Id.', '755 P.2d at 84', 'Id.', 'Id.', 'Id.', 'Id.', '2007-NMSC-014', '141 N.M. 413', '156 P.3d 704', '2004-NMCA-136', '136 N.M. 658', '103 P.3d 582', '2010-NMSC-035', '148 N.M. 713', '242 P.3d 280', '115 N.M. 159', '848 P.2d 1086', '107 N.M. at 240', '755 P.2d at 84', '107 N.M. at 238', '755 P.2d at 82', '89 F.3d 1423', '956 F.2d 738', 'Id.', '543 P.2d 108', '106 N.M. 492', '745 P.2d 727', '143 N.M. 274', '175 P.3d 942', '2010-NMCA-052', '148 N.M. 277', '234 P.3d 929']
https://cite.case.law/nmca/2013/048/4190470/ ['2013-NMCA-048', '§§', '26 U.S.C. § 501', '§', '2003-NMSC-005', '133 N.M. 97', '61 P.3d 806', '2006-NMCA-095', '140 N.M. 198', '141 P.3d 542', '2008-NMCA-065', '144 N.M. 132', '184 P.3d 444', '2009-NMCA-009', '145 N.M. 494', '200 P.3d 544', '1999-NMCA-156', '128 N.M. 398', '993 P.2d 112', '1999-NMSC-021', '127 N.M. 120', '978 P.2d 327', '2010-NMCA-096', '148 N.M. 934', '242 P.3d 501', '1998-NMSC-050', '126 N.M. 413', '970 P.2d 599', '121 N.M. 764', '918 P.2d 350', '2006-NMSC-004', '139 N.M. 24', '127 P.3d 1111', '2009-NMSC-036', '146 N.M. 473', '212 P.3d 361', '§', '§', '93 N.M. 42', '596 P.2d 255', '2005-NMCA-029', '137 N.M. 103', '107 P.3d 543', '2009-NMCA-009', '2000-NMCA-074', '129 N.M. 413', '9 P.3d 657', '2001-NMCA-042', '130 N.M. 543', '28 P.3d 531']
https://cite.case.law/nmca/2012/116/4190761/ ['2012-NMCA-116', '2011-NMSC-014', '150 N.M. 84', '257 P.3d 904', 'Id.', '2011-NMSC-014', '411 U.S. 778', 'Id.', '408 U.S. 471', '2011-NMSC-014', '91 N.M. 749', '643 P.2d 618', '2011-NMSC-014', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'id.', 'Id.', '612 A.2d 288', 'Id.']
https://cite.case.law/nmca/2013/041/4190492/ ['2013-NMCA-041', '§§', '§]', '2012-NMSC-028', '285 P.3d 595', '2007-NMCA-098', '142 N.M. 319', '164 P.3d 1018', '2004-NMSC-010', '135 N.M. 397', '89 P.3d 69', '121 N.M. 764', '918 P.2d 350', '2009-NMSC-050', '147 N.M. 182', '218 P.3d 868', 'Id.', '2009-NMSC-049', '147 N.M. 177', '218 P.3d 863', '118 N.M. 234', '880 P.2d 845', '77 N.M. 742', '427 P.2d 258', '§', '§', '§§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '2009-NMCA-097', '147 N.M. 6', '216 P.3d 256', '2007-NMCA-069', '141 N.M. 686', '160 P.3d 595', '113 N.M. 231', '824 P.2d 1033', '113 N.M. at 236', '824 P.2d at 1038', '2005-NMCA-128', '138 N.M. 588', '124 P.3d 566', '2012-NMSC-026', '283 P.3d 853', '863 A.2d 976', '2010-NMCA-053', '148 N.M. 322', '236 P.3d 41', '106 N.M. 613', '747 P.2d 259', '2011-NMCA-016', '149 N.M. 420', '249 P.3d 1243']
https://cite.case.law/nmca/2013/025/4190553/ ['2013-NMCA-025', '§', '§', '2005-NMCA-120', '138 N.M. 466', '122 P.3d 50', '2006-NMCA-106', '140 N.M. 230', '141 P.3d 1284', '2004-NMCA-104', '136 N.M. 240', '96 P.3d 801', '39 Duq. L. Rev. 567', '§', '§', '§§', '§§', '§', '§', 'Id.', '2012-NMSC-029', '285 P.3d 622', 'Id.', 'Id.', 'Id.', '1999-NMCA-018', '126 N.M. 579', '973 P.2d 256', 'Id.', 'Id.', 'Id.', 'Id.', '906 P.2d 122', '717 N.E.2d 322', 'Ohio Rev. Code Ann. § 3103.04', 'Id.', '717 N.E.2d at 326', 'Id.', 'Id.', '44 Cal. Rptr. 330', '20 Cal. Rptr. 2d 582', 'Cal. Code § 5102', '44 Cal. Rptr. at 336', 'Id.', 'Id.', '268 Cal. Rptr. 501', 'Cal. Code § 5102', 'Id.', '268 Cal. Rptr. at 503', 'Id.', '2012-NMSC-029', '44 Cal. Rptr. at 336', '119 N.M. 609', '894 P.2d 386', '721 N.E.2d 73', '44 Cal. Rptr. at 333', '39 Duq. L. Rev. 567', '94 N.M. 706', '616 P.2d 419', '2011-NMSC-041', '150 N.M. 654', '265 P.3d 705', 'Id.', 'Id.', '94 N.M. at 708', '616 P.2d at 421', 'Id.', '§', '2012-NMCA-084', '284 P.3d 410', 'Id.', '1999-NMSC-001', '126 N.M. 438', '971 P.2d 829', '2012-NMCA-017', '2012-NMCERT-001', 'Id.', 'Id.', '2005-NMCA-045', '137 N.M. 339', '110 P.3d 1076', '1999-NMCA-152', '128 N.M. 345', '992 P.2d 896', 'id.', 'Id.', '2004-NMSC-019', '135 N.M. 621', '92 P.3d 633', 'Id.', '2005-NMSC-031', '138 N.M. 365', '120 P.3d 447', '2006-NMSC-001', '138 N.M. 700', '126 P.3d 516', '2009-NMSC-004', '145 N.M. 513', '201 P.3d 844', 'Id.', 'Id.', '94 N.M. 17', '606 P.2d 1111', '82 N.M. 333', '481 P.2d 412']
https://cite.case.law/nmca/2013/047/4191281/ ['2013-NMCA-047', '§', '§', '2004-NMCA-111', '136 N.M. 301', '97 P.3d 633', '2009-NMCA-110', '147 N.M. 127', '217 P.3d 613', '101 N.M. 694', '688 P.2d 12', 'Id.', '688 P.2d at 20', 'Id.', '688 P.2d at 15', '120 N.M. 734', '906 P.2d 266', 'Id.', '1999-NMCA-143', '128 N.M. 371', '993 P.2d 85', '1999-NMCA-143', '1999-NMCA-143', '2004-NMCA-111', '115 N.M. 710', '858 P.2d 86', 'Id.', '858 P.2d at 92', 'Id.', '858 P.2d at 92', 'Id.', 'Id.', '101 N.M. at 699', '688 P.2d at 17']
https://cite.case.law/nmca/2013/028/4190584/ ['2013-NMCA-028', '1997-NMSC-044', '123 N.M. 778', '945 P.2d 996', '2001-NMCA-094', '131 N.M. 195', '34 P.3d 139', '121 N.M. 38', '908 P.2d 731', 'Id.', '2000-NMCA-085', '129 N.M. 547', '10 P.3d 871', 'Id.', 'Id.', '2002-NMSC-007', '131 N.M. 758', '42 P.3d 1207', '117 N.M. 11', '868 P.2d 656', 'Id.', '§', '§', '80 N.M. 340', '455 P.2d 844', '121 N.M. at 44', '908 P.2d at 737', 'Id.', '2007-NMCA-035', '141 N.M. 328', '154 P.3d 703', '2000-NMSC-002', '128 N.M. 482', '994 P.2d 728', '2005-NMCA-010', '136 N.M. 723', '104 P.3d 1114', '§', '2007-NMCA-160', '143 N.M. 96', '173 P.3d 18', '2008-NMSC-048', '144 N.M. 663', '191 P.3d 521', '2009-NMSC-025', '146 N.M. 357', '210 P.3d 783', '112 N.M. 3', '810 P.2d 1223', '2007-NMSC-032', '142 N.M. 120', '164 P.3d 1', '112 N.M. at 13', '810 P.2d at 1233', '2008-NMSC-048', 'Id.', '2007-NMSC-032', 'Id.', '§', '2007-NMSC-032', 'Id.', '120 N.M. 486', '903 P.2d 228', '2007-NMSC-032', '119 N.M. 252', '889 P.2d 860', '2011-NMCA-121', '267 P.3d 820', '2012-NMCERT-008', '296 P.3d 491', '2007-NMSC-032', '2007-NMSC-032', '§', '2007-NMSC-032', 'Id.', 'Id.', 'Id.', '§', '§', '2003-NMCA-147', '134 N.M. 705', '82 P.3d 72', '§', '§', '§', '§', 'Kan. Stat. Ann. § 21-5408', '2006-NMSC-011', '131 P.3d 61', '§', '2007-NMSC-032', 'Id.', 'Id.', 'Id.', '§', '112 N.M. 554', '817 P.2d 1196', '2010-NMSC-020', '148 N.M. 381', '237 P.3d 683', '102 N.M. 274', '694 P.2d 922', '§', '§', '112 N.M. at 562', '817 P.2d at 1204', '2007-NMSC-032', 'Id.', '112 N.M. at 14', '810 P.2d at 1234', '949 A.2d 1092', '547 P.2d 720', '459 P.2d 225', '2012-NMCA-112', '289 P.3d 238', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', '2010-NMSC-005', '147 N.M. 557', '226 P.3d 656', '51 A.3d 970', '949 A.2d at 1121', '1999-NMCA-065', '127 N.M. 362', '981 P.2d 295', '2010-NMSC-005', '119 N.M. at 260', '889 P.2d at 868', '2012-NMCA-112', '2011-NMCA-018', '149 N.M. 294', '248 P.3d 336', '2011-NMCERT-001', '150 N.M. 559', '263 P.3d 901', '115 N.M. 6', '846 P.2d 312', '466 U.S. 668', '2011-NMCA-018', '115 N.M. at 17', '846 P.2d at 323', '2011-NMCA-018', '115 N.M. at 16', '846 P.2d at 322', '2002-NMSC-005', '131 N.M. 709', '42 P.3d 814', '2002-NMSC-027', '132 N.M. 657', '54 P.3d 61', 'id.', 'Id.', '2002-NMSC-027', 'Id.', 'Id.', 'Id.', '2006-NMCA-031', '139 N.M. 147', '130 P.3d 208', '2009-NMSC-018', '146 N.M. 142', '207 P.3d 1119', '2010-NMSC-041', '148 N.M. 747', '242 P.3d 314', '1998-NMCA-034', '124 N.M. 726', '955 P.2d 195', '1997-NMCA-117', '124 N.M. 261', '948 P.2d 1209', '2012-NMSC-008', '275 P.3d 110', '§', '2006-NMCA-110', '140 N.M. 356', '142 P.3d 944', '2006-NMCA-088', '140 N.M. 126', '140 P.3d 547', '98 N.M. 213', '647 P.2d 415', 'Id.', '1997-NMSC-004', '122 N.M. 794', '932 P.2d 484', '98 N.M. at 215', '647 P.2d at 417', 'id.', '2010-NMSC-041', 'Id.', '2010-NMSC-041', '2010-NMSC-041', '2001-NMCA-032', '130 N.M. 319', '24 P.3d 351', '2007-NMSC-057', '143 N.M. 7', '172 P.3d 144', '2006-NMCA-088', '2009-NMCA-102', '147 N.M. 26', '216 P.3d 276', '2000-NMSC-037', '130 N.M. 1', '15 P.3d 491', '2000-NMCA-033', '129 N.M. 47', '1 P.3d 429']
https://cite.case.law/nmca/2013/006/4191483/ ['2013-NMCA-006', '§§', '2011-NMSC-033', '150 N.M. 398', '259 P.3d 803', '2009-NMSC-021', '146 N.M. 256', '208 P.3d 901', 'Id.', '2011-NMSC-033', '2009-NMSC-021', '2013-NMCA-014', '293 P.3d 902', '2009-NMSC-021', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'id.', 'Id.', '2013-NMCA-014', 'Id.', 'Id.', '42 C.F.R. § 483.12', '2011-NMSC-033', '2009-NMSC-021', '2011-NMSC-033', '2009-NMSC-021', '2013-NMCA-014', '2013-NMCA-014', '2013-NMCA-014']
https://cite.case.law/nm/37/48/ ['125 P. 609', '126 Okla. 114', '258 P. 863', 'supra.', '47 Kan. 283', '27 P. 997', '141 Mass. 74', '6 N.E. 757', '§', '132 Mich. 395', '93 N.W. 914', '59 Minn. 111', '60 N.W. 1081', '191 P. 460', '§', '§', '199 P. 373', '14 Cal. App. 250', '111 P. 631', 'supra,', '68 W. Va. 493', '70 S.E. 119']
https://cite.case.law/nm/37/222/ ['202 P. 687', 'supra,', '219 P. 794', '§', '58 P. 393', '88 S.W. 363', '115 Wis. 317', '91 N.W. 107', '79 Wis. 546', '48 N.W. 653', '180 Wis. 577', '193 N.W. 353', '234 P. 311']
https://cite.case.law/nm/37/212/ ['§', '28 Stat. 278', '33 Stat. 811', '§', '§', '236 F. 340', '255 F. 683', '288 F. 187', 'supra,', '236 F. 342', 'supra,', 'supra,', '132 S.E. 800', '81 S.E. 418', '135 P. 553', 'supra,', 'supra,', '116 F. 145', '41 Ind. App. 620', '84 N.E. 555']
https://cite.case.law/nm/37/597/ ['194 P. 862']
https://cite.case.law/nm/37/478/ ['§', '295 P. 424', '218 P. 787', '§']
https://cite.case.law/nm/37/474/ ['§', '§', '236 P. 735', 'supra,', '247 P. 270']
https://cite.case.law/nm/37/101/ ['89 P. 259']
https://cite.case.law/nm/37/312/ ['246 P. 910', '299 P. 1008']
https://cite.case.law/nm/37/91/ ['§', '256 P. 179', 'supra.', '§', 'supra.', '240 P. 469', '298 P. 410', '290 P. 793', '222 P. 912', '256 P. 179', '76 Cal. 624', '18 P. 686', '287 P. 290', '147 P. 916', '249 P. 108', '85 P. 393', '§', '136 F. 168', '69 C.C.A. 80', '49 Ala. 567', '65 Colo. 258', '176 P. 302', '17 Ill. App. 30', '67 F. 384', '106 Wis. 387', '82 N.W. 302', '62 Minn. 498', '65 N.W. 84', '124 Cal. 568', '57 P. 561', '34 Cal. App. 272', '167 P. 299']
https://cite.case.law/nm/37/559/ ['221 Mo. App. 85', '290 S.W. 96', '162 Mo. App. 408', '142 S.W. 757', '178 S.W. 52', '69 Mo. App. 1']
https://cite.case.law/nm/37/226/ []
https://cite.case.law/nm/37/600/ ['§', '22 Cal. 191', '§', '287 P. 64', '44 A. 161', '59 A. 565']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-7520a1175d19> in <module>()
     32 for url, case_text in get_case_texts():
     33     cites = get_citations(case_text)
---> 34     print(url, [c.corrected_citation() for c in cites])

1 frames
/usr/local/lib/python3.7/dist-packages/eyecite/models.py in corrected_citation(self)
    200         if self.edition_guess:
    201             return self.matched_text().replace(
--> 202                 self.groups["reporter"], self.edition_guess.short_name
    203             )
    204         return self.matched_text()

KeyError: 'reporter'

@devlux76
Copy link

Looks like models.py at line around 201 needs some guard code to ensure the "reporter" key is present. Something along the lines of

if self.edition_guess:
   if "reporter" in self.groups:
        return self.matched_text().replace(self.groups["reporter"], self.edition_guess.short_name)
   return self.matched_text()

I'd have to look closer at what's calling that section of code to see what assumptions that breaks though.
But would you like me to work on this and get a patch in?

@mlissner
Copy link
Member Author

Yeah, seems like a good one to fix. Worth yanking into its own issue though, if you don't mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants