Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add graph dataset #19

Merged
merged 16 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 2 additions & 5 deletions configs/embedding.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,14 @@ defaults:
- dataset: pl-court-raw
- _self_

# length_adjust_mode: truncate
# truncation_tokens: 4096
# batch_size: 24
length_adjust_mode: chunk
chunk_config:
chunk_size: 512
chunk_size: ${embedding_model.max_seq_length}
min_split_chars: 10
take_n_first_chunks: 16
batch_size: 64

output_dir: data/embeddings/${dataset.name}/${hydra:runtime.choices.embedding_model}
output_dir: data/embeddings/${dataset.name}/${hydra:runtime.choices.embedding_model}/all_embeddings

hydra:
output_subdir: null
Expand Down
4 changes: 4 additions & 0 deletions data/datasets/pl/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
/raw
/graph/data
/graph/metadata.yaml
/graph/README.md
/graph/README_files
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
137 changes: 137 additions & 0 deletions data/datasets/pl/graph/template_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
language: {{language}}
size_categories: {{size_categories}}
source_datasets: {{source_datasets}}
pretty_name: {{pretty_name}}
viewer: {{viewer}}
tags: {{tags}}
---

# Polish Court Judgments Graph

## Dataset description
We introduce a graph dataset of Polish Court Judgments. This dataset is primarily based on the [`JuDDGES/pl-court-raw`](https://huggingface.co/datasets/JuDDGES/pl-court-raw). The dataset consists of nodes representing either judgments or legal bases, and edges connecting judgments to the legal bases they refer to. Also, the graph was cleaned from small disconnected components, leaving single giant component. Consequently, the resulting graph is bipartite. We provide the dataset in both `JSON` and `PyG` formats, each has different purpose. While structurally graphs in these formats are the same, their attributes differ.

The `JSON` format is intended for analysis and contains most of the attributes available in [`JuDDGES/pl-court-raw`](https://huggingface.co/datasets/JuDDGES/pl-court-raw). We excluded some less-useful attributes and text content, which can be easily retrieved from the raw dataset and added to the graph as needed.

The `PyG` format is designed for machine learning applications, such as link prediction on graphs, and is fully compatible with the [`Pytorch Geometric`](https://github.com/pyg-team/pytorch_geometric) framework.

In the following sections, we provide a more detailed explanation and use case examples for each format.

## Dataset statistics

| feature | value |
|----------------------------|----------------------|
| #nodes | {{num_nodes}} |
| #edges | {{num_edges}} |
| #nodes (type=`judgment`) | {{num_src_nodes}} |
| #nodes (type=`legal_base`) | {{num_target_nodes}} |
| avg(degree) | {{avg_degree}} |


![png](assets/degree_distribution.png)



## `JSON` format

The `JSON` format contains graph node types differentiated by `node_type` attrbute. Each `node_type` has its additional corresponding attributes (see [`JuDDGES/pl-court-raw`](https://huggingface.co/datasets/JuDDGES/pl-court-raw) for detailed description of each attribute):

| node_type | attributes |
|--------------|---------------------------------------------------------------------------------------------------------------------|
| `judgment` | {{judgment_attributes}} |
| `legal_base` | {{legal_base_attributes}} |

### Loading
Graph the `JSON` format is saved in node-link format, and can be readily loaded with `networkx` library:

```python
import json
import networkx as nx
from huggingface_hub import hf_hub_download

DATA_DIR = "<your_local_data_directory>"
JSON_FILE = "data/judgment_graph.json"
hf_hub_download(repo_id="JuDDGES/pl-court-graph", repo_type="dataset", filename=JSON_FILE, local_dir=DATA_DIR)

with open(f"{DATA_DIR}/{JSON_FILE}") as file:
g_data = json.load(file)

g = nx.node_link_graph(g_data)
```

### Example usage
```python
# TBD
```

## `PyG` format

The `PyTorch Geometric` format includes embeddings of the judgment content, obtained with [{{embedding_method}}](https://huggingface.co/{{embedding_method}}) for judgment nodes,
and one-hot-vector identifiers for legal-base nodes (note that for efficiency one can substitute it with random noise identifiers,
like in [(Abboud et al., 2021)](https://arxiv.org/abs/2010.01179)).



### Loading
In order to load graph as pytorch geometric, one can leverage the following code snippet
```python
import torch
import os
from torch_geometric.data import InMemoryDataset, download_url


class PlCourtGraphDataset(InMemoryDataset):
URL = (
"https://huggingface.co/datasets/JuDDGES/pl-court-graph/resolve/main/"
"data/pyg_judgment_graph.pt?download=true"
)

def __init__(self, root_dir: str, transform=None, pre_transform=None):
super(PlCourtGraphDataset, self).__init__(root_dir, transform, pre_transform)
data_file, index_file = self.processed_paths
self.load(data_file)
self.judgment_idx_2_iid, self.legal_base_idx_2_isap_id = torch.load(index_file).values()

@property
def raw_file_names(self) -> str:
return "pyg_judgment_graph.pt"

@property
def processed_file_names(self) -> list[str]:
return ["processed_pyg_judgment_graph.pt", "index_map.pt"]

def download(self) -> None:
os.makedirs(self.root, exist_ok=True)
download_url(self.URL + self.raw_file_names, self.raw_dir)

def process(self) -> None:
dataset = torch.load(self.raw_paths[0])
data = dataset["data"]

if self.pre_transform is not None:
data = self.pre_transform(data)

data_file, index_file = self.processed_paths
self.save([data], data_file)

torch.save(
{
"judgment_idx_2_iid": dataset["judgment_idx_2_iid"],
"legal_base_idx_2_isap_id": dataset["legal_base_idx_2_isap_id"],
},
index_file,
)

def __repr__(self) -> str:
return f"{self.__class__.__name__}({len(self)})"


ds = PlCourtGraphDataset(root_dir="data/datasets/pyg")
print(ds)
```

### Example usage
```python
# TBD
```
116 changes: 107 additions & 9 deletions dvc.lock
Original file line number Diff line number Diff line change
Expand Up @@ -203,12 +203,12 @@ stages:
size: 387238698
nfiles: 44
embed@mmlw-roberta-large:
cmd: PYTHONPATH=. python scripts/embed_text.py embedding_model=mmlw-roberta-large
cmd: PYTHONPATH=. python scripts/embed/embed_text.py embedding_model=mmlw-roberta-large
deps:
- path: configs/embedding.yaml
hash: md5
md5: e7515e27e3bb7ddb3a2062a46efaa773
size: 408
md5: 8eb43bec3f5fe10d5c1c5cfefc5d6fe5
size: 379
- path: configs/embedding_model/mmlw-roberta-large.yaml
hash: md5
md5: 22f36cfd196c0fdc3cfd8a036d52b606
Expand All @@ -218,15 +218,15 @@ stages:
md5: 5dd44be2eea852bcce3d0918ff8b97da.dir
size: 10234880729
nfiles: 17
- path: scripts/embed_text.py
- path: scripts/embed/embed_text.py
hash: md5
md5: 5813b589760b00ce693365a36c519ef0
size: 3384
md5: f3288be3419e01ebc2be904d52cbaab0
size: 3451
outs:
- path: data/embeddings/pl-court-raw/mmlw-roberta-large
- path: data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings
hash: md5
md5: b8ab133416880430c1f2d3d8357ffd7f.dir
size: 24368123117
md5: 1a086db46b90b0f3c4c66c3ecefe8adb.dir
size: 24415235644
nfiles: 53
predict@Unsloth-Llama-3-8B-Instruct-fine-tuned:
cmd: PYTHONPATH=. python scripts/sft/predict.py model=Unsloth-Llama-3-8B-Instruct-fine-tuned
Expand Down Expand Up @@ -341,6 +341,81 @@ stages:
hash: md5
md5: 091b8888275600052dd2dcdd36a55588
size: 305
aggregate_embeddings@mmlw-roberta-large:
cmd: PYTHONPATH=. python scripts/embed/aggregate_embeddings.py --embeddings-dir
data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings
deps:
- path: data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings
hash: md5
md5: 1a086db46b90b0f3c4c66c3ecefe8adb.dir
size: 24415235644
nfiles: 53
- path: scripts/embed/aggregate_embeddings.py
hash: md5
md5: 5b47bbdd9476d2a6f2ef43990be156f2
size: 1800
outs:
- path: data/embeddings/pl-court-raw/mmlw-roberta-large/agg_embeddings.pt
hash: md5
md5: 0d84b4da5513feeb6ca9bad70a2ff164
size: 1725566207
generate_graph_dataset:
cmd: PYTHONPATH=. python scripts/dataset/generate_graph_dataset.py --dataset-dir
data/datasets/pl/raw --embeddings-root-dir data/embeddings/pl-court-raw/mmlw-roberta-large/
--target-dir data/datasets/pl/graph
deps:
- path: data/datasets/pl/raw
hash: md5
md5: 5dd44be2eea852bcce3d0918ff8b97da.dir
size: 10234880729
nfiles: 17
- path: data/embeddings/pl-court-raw/mmlw-roberta-large/agg_embeddings.pt
hash: md5
md5: 0d84b4da5513feeb6ca9bad70a2ff164
size: 1725566207
- path: data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings/config.yaml
hash: md5
md5: fbb5585b8c3ef28255801d38c9248f8e
size: 502
- path: juddges/data/pl_court_graph.py
hash: md5
md5: 730e3d92be26408bd6dc26606b4c22ff
size: 4974
- path: scripts/dataset/generate_graph_dataset.py
hash: md5
md5: 3561a57587e54d1ed92deae0db8b66a4
size: 1189
outs:
- path: data/datasets/pl/graph/data
hash: md5
md5: f2820796cff4578c11ffcb0fa6cdadd7.dir
size: 1823760294
nfiles: 2
- path: data/datasets/pl/graph/metadata.yaml
hash: md5
md5: 68b09dd0ce741e6ee1fff4e37c954fa6
size: 564
predict@Unsloth-Llama-3-8B-Instruct:
cmd: PYTHONPATH=. python scripts/sft/predict.py model=Unsloth-Llama-3-8B-Instruct
deps:
- path: configs/model/Unsloth-Llama-3-8B-Instruct.yaml
hash: md5
md5: e97bb2e6bf39f75edea7714d6ba58b77
size: 160
- path: configs/predict.yaml
hash: md5
md5: e6b047cf62e612a32381d6221eb99b4e
size: 416
- path: scripts/sft/predict.py
hash: md5
md5: 69e4844a715c9c5c75e1127a06472ad4
size: 3148
outs:
- path:
data/experiments/predict/pl-court-instruct/outputs_Unsloth-Llama-3-8B-Instruct.json
hash: md5
md5: df2f1d464152f87737c8ebb5b0673854
size: 2179383
[email protected]:
cmd: PYTHONPATH=. python scripts/sft/predict.py model=Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned
deps:
Expand Down Expand Up @@ -478,3 +553,26 @@ stages:
hash: md5
md5: 2d1b6a392152f2e022a33553265e141a
size: 306
graph_dataset_readme:
cmd: jupyter nbconvert --no-input --to markdown --execute nbs/Data/03_Graph_Dataset_Description.ipynb
--output-dir data/datasets/pl/graph --output README
deps:
- path: data/datasets/pl/graph/data
hash: md5
md5: 0fc182cc099217043866ef3c488ce00e.dir
size: 1824126514
nfiles: 2
- path: nbs/Data/03_Graph_Dataset_Description.ipynb
hash: md5
md5: f690f997d78d356fa369f6c548ab0dd7
size: 43107
outs:
- path: data/datasets/pl/graph/README.md
hash: md5
md5: 460453f24ea5c20ea88ac8c11a854138
size: 4155
- path: data/datasets/pl/graph/README_files
hash: md5
md5: cabe6e2cc1195b673b68dcca8fe4705d.dir
size: 25265
nfiles: 1
37 changes: 33 additions & 4 deletions dvc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,45 @@ stages:
matrix:
model:
- mmlw-roberta-large
# - e5-mistral-7b-instruct
cmd: >-
PYTHONPATH=. python scripts/embed_text.py embedding_model=${item.model}
PYTHONPATH=. python scripts/embed/embed_text.py embedding_model=${item.model}
deps:
- scripts/embed_text.py
- scripts/embed/embed_text.py
- configs/embedding.yaml
- configs/embedding_model/${item.model}.yaml
- data/datasets/pl/raw
outs:
- data/embeddings/pl-court-raw/${item.model}
- data/embeddings/pl-court-raw/${item.model}/all_embeddings

aggregate_embeddings:
matrix:
model:
- mmlw-roberta-large
cmd: >-
PYTHONPATH=. python scripts/embed/aggregate_embeddings.py
--embeddings-dir data/embeddings/pl-court-raw/${item.model}/all_embeddings
deps:
- scripts/embed/aggregate_embeddings.py
- data/embeddings/pl-court-raw/${item.model}/all_embeddings
outs:
- data/embeddings/pl-court-raw/${item.model}/agg_embeddings.pt


generate_graph_dataset:
cmd: >-
PYTHONPATH=. python scripts/dataset/generate_graph_dataset.py
--dataset-dir data/datasets/pl/raw
--embeddings-root-dir data/embeddings/pl-court-raw/mmlw-roberta-large/
--target-dir data/datasets/pl/graph
deps:
- scripts/dataset/generate_graph_dataset.py
- juddges/data/pl_court_graph.py
- data/datasets/pl/raw
- data/embeddings/pl-court-raw/mmlw-roberta-large/agg_embeddings.pt
- data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings/config.yaml
outs:
- data/datasets/pl/graph/data
- data/datasets/pl/graph/metadata.yaml

sft:
matrix:
Expand Down
Loading
Loading