Discuss test data sets #87

mlissner · 2021-08-19T15:59:00Z

Also, apart from running the included tests, do you have a test dataset you can recommend?

Originally posted by @step21 in #86 (comment)

mlissner · 2021-08-19T16:02:45Z

(Just splitting this off here, so we can keep the other issue narrow.)

@step21, if you're just looking for some simple datasets to play with you could use the API at https://api.case.law or you can use the API at https://www.courtlistener.com/api/. The other thing is, if you are just experimenting for the sake of the JOSS review, you could just copy/paste some legal text and throw eyecite at it. For example, you could grab some text from here a recent SCOTUS opinion:

https://www.supremecourt.gov/opinions/slipopinion/20

Does that help?

jcushman · 2021-08-19T16:44:24Z

Here's an example of extracting cites from all of the case.law cases for New Mexico, if it helps to have a larger dataset to play with:

# pip install eyecite requests

import shutil
import zipfile
import lzma
import json
import requests
from pathlib import Path
from eyecite import get_citations

# download data file (66MB) if not already downloaded
download_url = "https://case.law/download/bulk_exports/latest/by_jurisdiction/case_text_open/nm/nm_text.zip"
output_path = "nm_text.zip"
if not Path(output_path).exists():
    print("Downloading to %s ..." % output_path)
    with open(output_path, 'wb') as out_file:
        shutil.copyfileobj(requests.get(download_url, stream=True).raw, out_file)
    print("Done.")

# yield case texts from data file
def get_case_texts():
    with zipfile.ZipFile(output_path, 'r') as zip_archive:
        xz_path = next(path for path in zip_archive.namelist() if path.endswith('/data.jsonl.xz'))
        with zip_archive.open(xz_path) as xz_archive, lzma.open(xz_archive) as jsonlines:
            for line in jsonlines:
                record = json.loads(str(line, 'utf-8'))
                case_body = record['casebody']['data']
                case_text = "\n".join([case_body['head_matter']]+[opinion['text'] for opinion in case_body['opinions']])
                yield record['frontend_url'], case_text

# extract citations
for url, case_text in get_case_texts():
    cites = get_citations(case_text)
    print(url, [c.corrected_citation() for c in cites])

step21 · 2021-09-29T10:13:39Z

Thanks! It's doing things, so that's a good start. It was mostly for the review and I am mostly satisfied, but just to be sure I ran this anyway, and it got a key error. As this key is not in your code, it must be sth else...?

Downloading to nm_text.zip ...
Done.
https://cite.case.law/nmca/2013/039/4191100/ ['2013-NMCA-039', '107 N.M. 236', '755 P.2d 80', '2007-NMSC-002', '141 N.M. 21', '150 P.3d 971', '1998-NMSC-046', '126 N.M. 396', '970 P.2d 582', '2009-NMCA-081', '146 N.M. 717', '213 P.3d 1146', 'Id.', '2009-NMCA-015', '145 N.M. 533', '202 P.3d 126', '99 N.M. 302', '657 P.2d 629', '2010-NMCA-060', '148 N.M. 367', '237 P.3d 111', '2010-NMCA-085', '148 N.M. 627', '241 P.3d 628', '2010-NMCA-060', '534 U.S. 19', '111 N.M. 319', '805 P.2d 88', '2005-NMCA-061', '137 N.M. 420', '112 P.3d 281', '2000-NMCA-010', '128 N.M. 648', '996 P.2d 911', '1999-NMCA-011', '126 N.M. 460', '971 P.2d 851', '839 F. Supp. 80', '498 P.2d 1240', '2010-NMCA-060', '2008-NMSC-022', '143 N.M. 740', '182 P.3d 121', '2005-NMCA-061', '137 N.M. 420', '112 P.3d 281', '847 F.2d 435', '186 F.2d 683', '388 So. 2d 128', '2006-NMCA-015', '139 N.M. 48', '128 P.3d 476', '107 N.M. at 237', '755 P.2d at 81', 'Id.', '755 P.2d at 82', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', '755 P.2d at 84', 'Id.', 'Id.', '755 P.2d at 83', 'Id.', '755 P.2d at 84', 'Id.', 'Id.', 'Id.', 'Id.', '2007-NMSC-014', '141 N.M. 413', '156 P.3d 704', '2004-NMCA-136', '136 N.M. 658', '103 P.3d 582', '2010-NMSC-035', '148 N.M. 713', '242 P.3d 280', '115 N.M. 159', '848 P.2d 1086', '107 N.M. at 240', '755 P.2d at 84', '107 N.M. at 238', '755 P.2d at 82', '89 F.3d 1423', '956 F.2d 738', 'Id.', '543 P.2d 108', '106 N.M. 492', '745 P.2d 727', '143 N.M. 274', '175 P.3d 942', '2010-NMCA-052', '148 N.M. 277', '234 P.3d 929']
https://cite.case.law/nmca/2013/048/4190470/ ['2013-NMCA-048', '§§', '26 U.S.C. § 501', '§', '2003-NMSC-005', '133 N.M. 97', '61 P.3d 806', '2006-NMCA-095', '140 N.M. 198', '141 P.3d 542', '2008-NMCA-065', '144 N.M. 132', '184 P.3d 444', '2009-NMCA-009', '145 N.M. 494', '200 P.3d 544', '1999-NMCA-156', '128 N.M. 398', '993 P.2d 112', '1999-NMSC-021', '127 N.M. 120', '978 P.2d 327', '2010-NMCA-096', '148 N.M. 934', '242 P.3d 501', '1998-NMSC-050', '126 N.M. 413', '970 P.2d 599', '121 N.M. 764', '918 P.2d 350', '2006-NMSC-004', '139 N.M. 24', '127 P.3d 1111', '2009-NMSC-036', '146 N.M. 473', '212 P.3d 361', '§', '§', '93 N.M. 42', '596 P.2d 255', '2005-NMCA-029', '137 N.M. 103', '107 P.3d 543', '2009-NMCA-009', '2000-NMCA-074', '129 N.M. 413', '9 P.3d 657', '2001-NMCA-042', '130 N.M. 543', '28 P.3d 531']
https://cite.case.law/nmca/2012/116/4190761/ ['2012-NMCA-116', '2011-NMSC-014', '150 N.M. 84', '257 P.3d 904', 'Id.', '2011-NMSC-014', '411 U.S. 778', 'Id.', '408 U.S. 471', '2011-NMSC-014', '91 N.M. 749', '643 P.2d 618', '2011-NMSC-014', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'id.', 'Id.', '612 A.2d 288', 'Id.']
https://cite.case.law/nmca/2013/041/4190492/ ['2013-NMCA-041', '§§', '§]', '2012-NMSC-028', '285 P.3d 595', '2007-NMCA-098', '142 N.M. 319', '164 P.3d 1018', '2004-NMSC-010', '135 N.M. 397', '89 P.3d 69', '121 N.M. 764', '918 P.2d 350', '2009-NMSC-050', '147 N.M. 182', '218 P.3d 868', 'Id.', '2009-NMSC-049', '147 N.M. 177', '218 P.3d 863', '118 N.M. 234', '880 P.2d 845', '77 N.M. 742', '427 P.2d 258', '§', '§', '§§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '§', '2009-NMCA-097', '147 N.M. 6', '216 P.3d 256', '2007-NMCA-069', '141 N.M. 686', '160 P.3d 595', '113 N.M. 231', '824 P.2d 1033', '113 N.M. at 236', '824 P.2d at 1038', '2005-NMCA-128', '138 N.M. 588', '124 P.3d 566', '2012-NMSC-026', '283 P.3d 853', '863 A.2d 976', '2010-NMCA-053', '148 N.M. 322', '236 P.3d 41', '106 N.M. 613', '747 P.2d 259', '2011-NMCA-016', '149 N.M. 420', '249 P.3d 1243']
https://cite.case.law/nmca/2013/025/4190553/ ['2013-NMCA-025', '§', '§', '2005-NMCA-120', '138 N.M. 466', '122 P.3d 50', '2006-NMCA-106', '140 N.M. 230', '141 P.3d 1284', '2004-NMCA-104', '136 N.M. 240', '96 P.3d 801', '39 Duq. L. Rev. 567', '§', '§', '§§', '§§', '§', '§', 'Id.', '2012-NMSC-029', '285 P.3d 622', 'Id.', 'Id.', 'Id.', '1999-NMCA-018', '126 N.M. 579', '973 P.2d 256', 'Id.', 'Id.', 'Id.', 'Id.', '906 P.2d 122', '717 N.E.2d 322', 'Ohio Rev. Code Ann. § 3103.04', 'Id.', '717 N.E.2d at 326', 'Id.', 'Id.', '44 Cal. Rptr. 330', '20 Cal. Rptr. 2d 582', 'Cal. Code § 5102', '44 Cal. Rptr. at 336', 'Id.', 'Id.', '268 Cal. Rptr. 501', 'Cal. Code § 5102', 'Id.', '268 Cal. Rptr. at 503', 'Id.', '2012-NMSC-029', '44 Cal. Rptr. at 336', '119 N.M. 609', '894 P.2d 386', '721 N.E.2d 73', '44 Cal. Rptr. at 333', '39 Duq. L. Rev. 567', '94 N.M. 706', '616 P.2d 419', '2011-NMSC-041', '150 N.M. 654', '265 P.3d 705', 'Id.', 'Id.', '94 N.M. at 708', '616 P.2d at 421', 'Id.', '§', '2012-NMCA-084', '284 P.3d 410', 'Id.', '1999-NMSC-001', '126 N.M. 438', '971 P.2d 829', '2012-NMCA-017', '2012-NMCERT-001', 'Id.', 'Id.', '2005-NMCA-045', '137 N.M. 339', '110 P.3d 1076', '1999-NMCA-152', '128 N.M. 345', '992 P.2d 896', 'id.', 'Id.', '2004-NMSC-019', '135 N.M. 621', '92 P.3d 633', 'Id.', '2005-NMSC-031', '138 N.M. 365', '120 P.3d 447', '2006-NMSC-001', '138 N.M. 700', '126 P.3d 516', '2009-NMSC-004', '145 N.M. 513', '201 P.3d 844', 'Id.', 'Id.', '94 N.M. 17', '606 P.2d 1111', '82 N.M. 333', '481 P.2d 412']
https://cite.case.law/nmca/2013/047/4191281/ ['2013-NMCA-047', '§', '§', '2004-NMCA-111', '136 N.M. 301', '97 P.3d 633', '2009-NMCA-110', '147 N.M. 127', '217 P.3d 613', '101 N.M. 694', '688 P.2d 12', 'Id.', '688 P.2d at 20', 'Id.', '688 P.2d at 15', '120 N.M. 734', '906 P.2d 266', 'Id.', '1999-NMCA-143', '128 N.M. 371', '993 P.2d 85', '1999-NMCA-143', '1999-NMCA-143', '2004-NMCA-111', '115 N.M. 710', '858 P.2d 86', 'Id.', '858 P.2d at 92', 'Id.', '858 P.2d at 92', 'Id.', 'Id.', '101 N.M. at 699', '688 P.2d at 17']
https://cite.case.law/nmca/2013/028/4190584/ ['2013-NMCA-028', '1997-NMSC-044', '123 N.M. 778', '945 P.2d 996', '2001-NMCA-094', '131 N.M. 195', '34 P.3d 139', '121 N.M. 38', '908 P.2d 731', 'Id.', '2000-NMCA-085', '129 N.M. 547', '10 P.3d 871', 'Id.', 'Id.', '2002-NMSC-007', '131 N.M. 758', '42 P.3d 1207', '117 N.M. 11', '868 P.2d 656', 'Id.', '§', '§', '80 N.M. 340', '455 P.2d 844', '121 N.M. at 44', '908 P.2d at 737', 'Id.', '2007-NMCA-035', '141 N.M. 328', '154 P.3d 703', '2000-NMSC-002', '128 N.M. 482', '994 P.2d 728', '2005-NMCA-010', '136 N.M. 723', '104 P.3d 1114', '§', '2007-NMCA-160', '143 N.M. 96', '173 P.3d 18', '2008-NMSC-048', '144 N.M. 663', '191 P.3d 521', '2009-NMSC-025', '146 N.M. 357', '210 P.3d 783', '112 N.M. 3', '810 P.2d 1223', '2007-NMSC-032', '142 N.M. 120', '164 P.3d 1', '112 N.M. at 13', '810 P.2d at 1233', '2008-NMSC-048', 'Id.', '2007-NMSC-032', 'Id.', '§', '2007-NMSC-032', 'Id.', '120 N.M. 486', '903 P.2d 228', '2007-NMSC-032', '119 N.M. 252', '889 P.2d 860', '2011-NMCA-121', '267 P.3d 820', '2012-NMCERT-008', '296 P.3d 491', '2007-NMSC-032', '2007-NMSC-032', '§', '2007-NMSC-032', 'Id.', 'Id.', 'Id.', '§', '§', '2003-NMCA-147', '134 N.M. 705', '82 P.3d 72', '§', '§', '§', '§', 'Kan. Stat. Ann. § 21-5408', '2006-NMSC-011', '131 P.3d 61', '§', '2007-NMSC-032', 'Id.', 'Id.', 'Id.', '§', '112 N.M. 554', '817 P.2d 1196', '2010-NMSC-020', '148 N.M. 381', '237 P.3d 683', '102 N.M. 274', '694 P.2d 922', '§', '§', '112 N.M. at 562', '817 P.2d at 1204', '2007-NMSC-032', 'Id.', '112 N.M. at 14', '810 P.2d at 1234', '949 A.2d 1092', '547 P.2d 720', '459 P.2d 225', '2012-NMCA-112', '289 P.3d 238', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', '2010-NMSC-005', '147 N.M. 557', '226 P.3d 656', '51 A.3d 970', '949 A.2d at 1121', '1999-NMCA-065', '127 N.M. 362', '981 P.2d 295', '2010-NMSC-005', '119 N.M. at 260', '889 P.2d at 868', '2012-NMCA-112', '2011-NMCA-018', '149 N.M. 294', '248 P.3d 336', '2011-NMCERT-001', '150 N.M. 559', '263 P.3d 901', '115 N.M. 6', '846 P.2d 312', '466 U.S. 668', '2011-NMCA-018', '115 N.M. at 17', '846 P.2d at 323', '2011-NMCA-018', '115 N.M. at 16', '846 P.2d at 322', '2002-NMSC-005', '131 N.M. 709', '42 P.3d 814', '2002-NMSC-027', '132 N.M. 657', '54 P.3d 61', 'id.', 'Id.', '2002-NMSC-027', 'Id.', 'Id.', 'Id.', '2006-NMCA-031', '139 N.M. 147', '130 P.3d 208', '2009-NMSC-018', '146 N.M. 142', '207 P.3d 1119', '2010-NMSC-041', '148 N.M. 747', '242 P.3d 314', '1998-NMCA-034', '124 N.M. 726', '955 P.2d 195', '1997-NMCA-117', '124 N.M. 261', '948 P.2d 1209', '2012-NMSC-008', '275 P.3d 110', '§', '2006-NMCA-110', '140 N.M. 356', '142 P.3d 944', '2006-NMCA-088', '140 N.M. 126', '140 P.3d 547', '98 N.M. 213', '647 P.2d 415', 'Id.', '1997-NMSC-004', '122 N.M. 794', '932 P.2d 484', '98 N.M. at 215', '647 P.2d at 417', 'id.', '2010-NMSC-041', 'Id.', '2010-NMSC-041', '2010-NMSC-041', '2001-NMCA-032', '130 N.M. 319', '24 P.3d 351', '2007-NMSC-057', '143 N.M. 7', '172 P.3d 144', '2006-NMCA-088', '2009-NMCA-102', '147 N.M. 26', '216 P.3d 276', '2000-NMSC-037', '130 N.M. 1', '15 P.3d 491', '2000-NMCA-033', '129 N.M. 47', '1 P.3d 429']
https://cite.case.law/nmca/2013/006/4191483/ ['2013-NMCA-006', '§§', '2011-NMSC-033', '150 N.M. 398', '259 P.3d 803', '2009-NMSC-021', '146 N.M. 256', '208 P.3d 901', 'Id.', '2011-NMSC-033', '2009-NMSC-021', '2013-NMCA-014', '293 P.3d 902', '2009-NMSC-021', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'Id.', 'id.', 'Id.', '2013-NMCA-014', 'Id.', 'Id.', '42 C.F.R. § 483.12', '2011-NMSC-033', '2009-NMSC-021', '2011-NMSC-033', '2009-NMSC-021', '2013-NMCA-014', '2013-NMCA-014', '2013-NMCA-014']
https://cite.case.law/nm/37/48/ ['125 P. 609', '126 Okla. 114', '258 P. 863', 'supra.', '47 Kan. 283', '27 P. 997', '141 Mass. 74', '6 N.E. 757', '§', '132 Mich. 395', '93 N.W. 914', '59 Minn. 111', '60 N.W. 1081', '191 P. 460', '§', '§', '199 P. 373', '14 Cal. App. 250', '111 P. 631', 'supra,', '68 W. Va. 493', '70 S.E. 119']
https://cite.case.law/nm/37/222/ ['202 P. 687', 'supra,', '219 P. 794', '§', '58 P. 393', '88 S.W. 363', '115 Wis. 317', '91 N.W. 107', '79 Wis. 546', '48 N.W. 653', '180 Wis. 577', '193 N.W. 353', '234 P. 311']
https://cite.case.law/nm/37/212/ ['§', '28 Stat. 278', '33 Stat. 811', '§', '§', '236 F. 340', '255 F. 683', '288 F. 187', 'supra,', '236 F. 342', 'supra,', 'supra,', '132 S.E. 800', '81 S.E. 418', '135 P. 553', 'supra,', 'supra,', '116 F. 145', '41 Ind. App. 620', '84 N.E. 555']
https://cite.case.law/nm/37/597/ ['194 P. 862']
https://cite.case.law/nm/37/478/ ['§', '295 P. 424', '218 P. 787', '§']
https://cite.case.law/nm/37/474/ ['§', '§', '236 P. 735', 'supra,', '247 P. 270']
https://cite.case.law/nm/37/101/ ['89 P. 259']
https://cite.case.law/nm/37/312/ ['246 P. 910', '299 P. 1008']
https://cite.case.law/nm/37/91/ ['§', '256 P. 179', 'supra.', '§', 'supra.', '240 P. 469', '298 P. 410', '290 P. 793', '222 P. 912', '256 P. 179', '76 Cal. 624', '18 P. 686', '287 P. 290', '147 P. 916', '249 P. 108', '85 P. 393', '§', '136 F. 168', '69 C.C.A. 80', '49 Ala. 567', '65 Colo. 258', '176 P. 302', '17 Ill. App. 30', '67 F. 384', '106 Wis. 387', '82 N.W. 302', '62 Minn. 498', '65 N.W. 84', '124 Cal. 568', '57 P. 561', '34 Cal. App. 272', '167 P. 299']
https://cite.case.law/nm/37/559/ ['221 Mo. App. 85', '290 S.W. 96', '162 Mo. App. 408', '142 S.W. 757', '178 S.W. 52', '69 Mo. App. 1']
https://cite.case.law/nm/37/226/ []
https://cite.case.law/nm/37/600/ ['§', '22 Cal. 191', '§', '287 P. 64', '44 A. 161', '59 A. 565']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-7520a1175d19> in <module>()
     32 for url, case_text in get_case_texts():
     33     cites = get_citations(case_text)
---> 34     print(url, [c.corrected_citation() for c in cites])

1 frames
/usr/local/lib/python3.7/dist-packages/eyecite/models.py in corrected_citation(self)
    200         if self.edition_guess:
    201             return self.matched_text().replace(
--> 202                 self.groups["reporter"], self.edition_guess.short_name
    203             )
    204         return self.matched_text()

KeyError: 'reporter'

devlux76 · 2021-12-29T01:36:24Z

Looks like models.py at line around 201 needs some guard code to ensure the "reporter" key is present. Something along the lines of

if self.edition_guess:
   if "reporter" in self.groups:
        return self.matched_text().replace(self.groups["reporter"], self.edition_guess.short_name)
   return self.matched_text()

I'd have to look closer at what's calling that section of code to see what assumptions that breaks though.
But would you like me to work on this and get a patch in?

mlissner · 2021-12-31T00:58:55Z

Yeah, seems like a good one to fix. Worth yanking into its own issue though, if you don't mind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss test data sets #87

Discuss test data sets #87

mlissner commented Aug 19, 2021

mlissner commented Aug 19, 2021

jcushman commented Aug 19, 2021

step21 commented Sep 29, 2021

devlux76 commented Dec 29, 2021

mlissner commented Dec 31, 2021

Discuss test data sets #87

Discuss test data sets #87

Comments

mlissner commented Aug 19, 2021

mlissner commented Aug 19, 2021

jcushman commented Aug 19, 2021

step21 commented Sep 29, 2021

devlux76 commented Dec 29, 2021

mlissner commented Dec 31, 2021