metadata #1

legaltextai · 2024-08-29T10:51:48Z

Thank you for making this research available on github.

"Our design exploits the fact that while LLMs are known to have been trained on the raw text of American case law, which is in the public domain (Henderson et al. 2022), they have likely not been trained on these cases’ attendant metadata, which exist separately from the cases’ textual content and which we have aggregated from disparate sources.
These metadata enable us to construct reference-based queries for the first nine of our tasks (Table 2)."

Do I understand correctly that the authors knew these models have likely not been trained on metadata, but still proceeded to evaluate these base old models on the knowledge of such data?

What am I missing?

mattdahl · 2024-08-29T16:53:44Z

Yes, that's right. We assumed that the foundation models we tested were not trained on tabular case metadata (but were trained on the corpus of American case law itself), meaning that when a model provided a correct answer, it was evidence of its emergent knowledge/reasoning ability and not simply memorization. You've probably also seen our other paper where we look at some RAG systems that do provide some of this metadata directly to the LLM (https://arxiv.org/abs/2405.20362). Feel free to message me on the FLP Slack if you want to talk more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata #1

metadata #1

legaltextai commented Aug 29, 2024

mattdahl commented Aug 29, 2024

metadata #1

metadata #1

Comments

legaltextai commented Aug 29, 2024

mattdahl commented Aug 29, 2024