Skip to content

Latest commit

 

History

History
40 lines (22 loc) · 2.54 KB

README.md

File metadata and controls

40 lines (22 loc) · 2.54 KB

Harvard Caselaw Access Project temporal analysis

The goal of the project is to apply information retrieval techniques to legal text, in particular, LDA topic modelling and Word Embeddings.

The main idea is to study words in the temporal axis, finding trends regarding context, frequency and topics.

The dataset in use is the Illinois portion of the Harvard Caselaw Access Project.

Methodology

The overall process can be divided in preprocessing, topic modelling and word embeddings.

A complete overview of the methodology and the results can be found on the project report.

Preprocessing

This phase uses Spacy to tokenize the text and obtain a lemmatized version of each token.

Topic modelling

The next phase involves finding topics in the dataset, optimizing the number of components using an Halving search approach. After finding a general overview of the data, a more refined topic modelling is run on a subset of the found topics.

Word embeddings

The idea of this part is to train Word2vec models on year and epochs of the data, a similar work can be found here, in fact, we give credits to them for the model alignment that makes all the analysis in this part possible.

Webapp

The project is accessible through a webapp, please mind the loading time required if an instance is not running, about 1 to 3 minutes.

It is possible to search for single or group of words, each query is separated by a minus, while a group is concatenated with the comma, e.g. cocaine, cannabis - gun, searches for two queries, the combination of cocaine and cannabis, and gun respectively.

Here are some screenshots for the semantic shift part that exploit the words embeddings, and the topic modelling section.

Semantic shift Semantic shift Topic modelling Topic modelling Topic modelling