Add graph dataset (#19)

* Add graph dataset construction * Add stages with aggregating embeddings and reproduce embedding * Add script for ingesting embeddings to mongodb * Add and reproduce stage with graph generation * Make dataset dump skipping embedding column * Refine structure of graph-dataset generation and add upload script pushing it to hf-hub * Extend set of attributes in graph generation, refine dataset card * Add graph analysis notebook with example use case
pwr-ai · Jun 7, 2024 · af0b99c · af0b99c
1 parent 18d4ecd
commit af0b99c
Show file tree

Hide file tree

Showing 19 changed files with 1,319 additions and 26 deletions.
diff --git a/configs/embedding.yaml b/configs/embedding.yaml
@@ -3,17 +3,14 @@ defaults:
   - dataset: pl-court-raw
   - _self_
 
-# length_adjust_mode: truncate
-# truncation_tokens: 4096
-# batch_size: 24
 length_adjust_mode: chunk
 chunk_config:
-  chunk_size: 512
+  chunk_size: ${embedding_model.max_seq_length}
   min_split_chars: 10
   take_n_first_chunks: 16
 batch_size: 64
 
-output_dir: data/embeddings/${dataset.name}/${hydra:runtime.choices.embedding_model}
+output_dir: data/embeddings/${dataset.name}/${hydra:runtime.choices.embedding_model}/all_embeddings
 
 hydra:  
   output_subdir: null  

diff --git a/data/datasets/pl/.gitignore b/data/datasets/pl/.gitignore
@@ -1 +1,5 @@
 /raw
+/graph/data
+/graph/metadata.yaml
+/graph/README.md
+/graph/README_files
diff --git a/data/datasets/pl/graph/assets/degree_distribution.png b/data/datasets/pl/graph/assets/degree_distribution.png
diff --git a/data/datasets/pl/graph/template_README.md b/data/datasets/pl/graph/template_README.md
@@ -0,0 +1,137 @@
+---
+language: {{language}}
+size_categories: {{size_categories}}
+source_datasets: {{source_datasets}}
+pretty_name: {{pretty_name}}
+viewer: {{viewer}}
+tags: {{tags}}
+---
+
+# Polish Court Judgments Graph
+
+## Dataset description
+We introduce a graph dataset of Polish Court Judgments. This dataset is primarily based on the [`JuDDGES/pl-court-raw`](https://huggingface.co/datasets/JuDDGES/pl-court-raw). The dataset consists of nodes representing either judgments or legal bases, and edges connecting judgments to the legal bases they refer to. Also, the graph was cleaned from small disconnected components, leaving single giant component. Consequently, the resulting graph is bipartite. We provide the dataset in both `JSON` and `PyG` formats, each has different purpose. While structurally graphs in these formats are the same, their attributes differ. 
+
+The `JSON` format is intended for analysis and contains most of the attributes available in [`JuDDGES/pl-court-raw`](https://huggingface.co/datasets/JuDDGES/pl-court-raw). We excluded some less-useful attributes and text content, which can be easily retrieved from the raw dataset and added to the graph as needed.
+
+The `PyG` format is designed for machine learning applications, such as link prediction on graphs, and is fully compatible with the [`Pytorch Geometric`](https://github.com/pyg-team/pytorch_geometric) framework. 
+
+In the following sections, we provide a more detailed explanation and use case examples for each format.
+
+## Dataset statistics
+
+| feature                    | value                |
+|----------------------------|----------------------|
+| #nodes                     | {{num_nodes}}        |
+| #edges                     | {{num_edges}}        |
+| #nodes (type=`judgment`)  | {{num_src_nodes}}    |
+| #nodes (type=`legal_base`) | {{num_target_nodes}} |
+| avg(degree)                | {{avg_degree}}       |
+
+
+![png](assets/degree_distribution.png)
+
+
+
+## `JSON` format
+
+The `JSON` format contains graph node types differentiated by `node_type` attrbute. Each `node_type` has its additional corresponding attributes (see [`JuDDGES/pl-court-raw`](https://huggingface.co/datasets/JuDDGES/pl-court-raw) for detailed description of each attribute):
+
+| node_type    | attributes                                                                                                          |
+|--------------|---------------------------------------------------------------------------------------------------------------------|
+| `judgment`   | {{judgment_attributes}}  |
+| `legal_base` | {{legal_base_attributes}}                                                                                                |
+
+### Loading
+Graph the `JSON` format is saved in node-link format, and can be readily loaded with `networkx` library:
+
+```python
+import json
+import networkx as nx
+from huggingface_hub import hf_hub_download
+
+DATA_DIR = "<your_local_data_directory>"
+JSON_FILE = "data/judgment_graph.json"
+hf_hub_download(repo_id="JuDDGES/pl-court-graph", repo_type="dataset", filename=JSON_FILE, local_dir=DATA_DIR)
+
+with open(f"{DATA_DIR}/{JSON_FILE}") as file:
+    g_data = json.load(file)
+
+g = nx.node_link_graph(g_data)
+```
+
+### Example usage
+```python
+# TBD
+```
+
+## `PyG` format
+
+The `PyTorch Geometric` format includes embeddings of the judgment content, obtained with [{{embedding_method}}](https://huggingface.co/{{embedding_method}}) for judgment nodes, 
+and one-hot-vector identifiers for legal-base nodes (note that for efficiency one can substitute it with random noise identifiers, 
+like in [(Abboud et al., 2021)](https://arxiv.org/abs/2010.01179)).
+
+
+
+### Loading
+In order to load graph as pytorch geometric, one can leverage the following code snippet
+```python
+import torch
+import os
+from torch_geometric.data import InMemoryDataset, download_url
+
+
+class PlCourtGraphDataset(InMemoryDataset):
+    URL = (
+        "https://huggingface.co/datasets/JuDDGES/pl-court-graph/resolve/main/"
+        "data/pyg_judgment_graph.pt?download=true"
+    )
+
+    def __init__(self, root_dir: str, transform=None, pre_transform=None):
+        super(PlCourtGraphDataset, self).__init__(root_dir, transform, pre_transform)
+        data_file, index_file = self.processed_paths
+        self.load(data_file)
+        self.judgment_idx_2_iid, self.legal_base_idx_2_isap_id = torch.load(index_file).values()
+
+    @property
+    def raw_file_names(self) -> str:
+        return "pyg_judgment_graph.pt"
+
+    @property
+    def processed_file_names(self) -> list[str]:
+        return ["processed_pyg_judgment_graph.pt", "index_map.pt"]
+
+    def download(self) -> None:
+        os.makedirs(self.root, exist_ok=True)
+        download_url(self.URL + self.raw_file_names, self.raw_dir)
+
+    def process(self) -> None:
+        dataset = torch.load(self.raw_paths[0])
+        data = dataset["data"]
+
+        if self.pre_transform is not None:
+            data = self.pre_transform(data)
+
+        data_file, index_file = self.processed_paths
+        self.save([data], data_file)
+
+        torch.save(
+            {
+                "judgment_idx_2_iid": dataset["judgment_idx_2_iid"],
+                "legal_base_idx_2_isap_id": dataset["legal_base_idx_2_isap_id"],
+            },
+            index_file,
+        )
+
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}({len(self)})"
+
+
+ds = PlCourtGraphDataset(root_dir="data/datasets/pyg")
+print(ds)
+```
+
+### Example usage
+```python
+# TBD
+```
diff --git a/dvc.lock b/dvc.lock
@@ -203,12 +203,12 @@ stages:
       size: 387238698
       nfiles: 44
   embed@mmlw-roberta-large:
-    cmd: PYTHONPATH=. python scripts/embed_text.py embedding_model=mmlw-roberta-large
+    cmd: PYTHONPATH=. python scripts/embed/embed_text.py embedding_model=mmlw-roberta-large
     deps:
     - path: configs/embedding.yaml
       hash: md5
-      md5: e7515e27e3bb7ddb3a2062a46efaa773
-      size: 408
+      md5: 8eb43bec3f5fe10d5c1c5cfefc5d6fe5
+      size: 379
     - path: configs/embedding_model/mmlw-roberta-large.yaml
       hash: md5
       md5: 22f36cfd196c0fdc3cfd8a036d52b606
@@ -218,15 +218,15 @@ stages:
       md5: 5dd44be2eea852bcce3d0918ff8b97da.dir
       size: 10234880729
       nfiles: 17
-    - path: scripts/embed_text.py
+    - path: scripts/embed/embed_text.py
       hash: md5
-      md5: 5813b589760b00ce693365a36c519ef0
-      size: 3384
+      md5: f3288be3419e01ebc2be904d52cbaab0
+      size: 3451
     outs:
-    - path: data/embeddings/pl-court-raw/mmlw-roberta-large
+    - path: data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings
       hash: md5
-      md5: b8ab133416880430c1f2d3d8357ffd7f.dir
-      size: 24368123117
+      md5: 1a086db46b90b0f3c4c66c3ecefe8adb.dir
+      size: 24415235644
       nfiles: 53
   predict@Unsloth-Llama-3-8B-Instruct-fine-tuned:
     cmd: PYTHONPATH=. python scripts/sft/predict.py model=Unsloth-Llama-3-8B-Instruct-fine-tuned
@@ -341,6 +341,81 @@ stages:
       hash: md5
       md5: 091b8888275600052dd2dcdd36a55588
       size: 305
+  aggregate_embeddings@mmlw-roberta-large:
+    cmd: PYTHONPATH=. python scripts/embed/aggregate_embeddings.py  --embeddings-dir
+      data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings
+    deps:
+    - path: data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings
+      hash: md5
+      md5: 1a086db46b90b0f3c4c66c3ecefe8adb.dir
+      size: 24415235644
+      nfiles: 53
+    - path: scripts/embed/aggregate_embeddings.py
+      hash: md5
+      md5: 5b47bbdd9476d2a6f2ef43990be156f2
+      size: 1800
+    outs:
+    - path: data/embeddings/pl-court-raw/mmlw-roberta-large/agg_embeddings.pt
+      hash: md5
+      md5: 0d84b4da5513feeb6ca9bad70a2ff164
+      size: 1725566207
+  generate_graph_dataset:
+    cmd: PYTHONPATH=. python scripts/dataset/generate_graph_dataset.py --dataset-dir
+      data/datasets/pl/raw  --embeddings-root-dir data/embeddings/pl-court-raw/mmlw-roberta-large/
+      --target-dir data/datasets/pl/graph
+    deps:
+    - path: data/datasets/pl/raw
+      hash: md5
+      md5: 5dd44be2eea852bcce3d0918ff8b97da.dir
+      size: 10234880729
+      nfiles: 17
+    - path: data/embeddings/pl-court-raw/mmlw-roberta-large/agg_embeddings.pt
+      hash: md5
+      md5: 0d84b4da5513feeb6ca9bad70a2ff164
+      size: 1725566207
+    - path: data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings/config.yaml
+      hash: md5
+      md5: fbb5585b8c3ef28255801d38c9248f8e
+      size: 502
+    - path: juddges/data/pl_court_graph.py
+      hash: md5
+      md5: 730e3d92be26408bd6dc26606b4c22ff
+      size: 4974
+    - path: scripts/dataset/generate_graph_dataset.py
+      hash: md5
+      md5: 3561a57587e54d1ed92deae0db8b66a4
+      size: 1189
+    outs:
+    - path: data/datasets/pl/graph/data
+      hash: md5
+      md5: f2820796cff4578c11ffcb0fa6cdadd7.dir
+      size: 1823760294
+      nfiles: 2
+    - path: data/datasets/pl/graph/metadata.yaml
+      hash: md5
+      md5: 68b09dd0ce741e6ee1fff4e37c954fa6
+      size: 564
+  predict@Unsloth-Llama-3-8B-Instruct:
+    cmd: PYTHONPATH=. python scripts/sft/predict.py model=Unsloth-Llama-3-8B-Instruct
+    deps:
+    - path: configs/model/Unsloth-Llama-3-8B-Instruct.yaml
+      hash: md5
+      md5: e97bb2e6bf39f75edea7714d6ba58b77
+      size: 160
+    - path: configs/predict.yaml
+      hash: md5
+      md5: e6b047cf62e612a32381d6221eb99b4e
+      size: 416
+    - path: scripts/sft/predict.py
+      hash: md5
+      md5: 69e4844a715c9c5c75e1127a06472ad4
+      size: 3148
+    outs:
+    - path: 
+        data/experiments/predict/pl-court-instruct/outputs_Unsloth-Llama-3-8B-Instruct.json
+      hash: md5
+      md5: df2f1d464152f87737c8ebb5b0673854
+      size: 2179383
   [email protected]:
     cmd: PYTHONPATH=. python scripts/sft/predict.py model=Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned
     deps:
@@ -478,3 +553,26 @@ stages:
       hash: md5
       md5: 2d1b6a392152f2e022a33553265e141a
       size: 306
+  graph_dataset_readme:
+    cmd: jupyter nbconvert  --no-input  --to markdown  --execute nbs/Data/03_Graph_Dataset_Description.ipynb
+      --output-dir data/datasets/pl/graph --output README
+    deps:
+    - path: data/datasets/pl/graph/data
+      hash: md5
+      md5: 0fc182cc099217043866ef3c488ce00e.dir
+      size: 1824126514
+      nfiles: 2
+    - path: nbs/Data/03_Graph_Dataset_Description.ipynb
+      hash: md5
+      md5: f690f997d78d356fa369f6c548ab0dd7
+      size: 43107
+    outs:
+    - path: data/datasets/pl/graph/README.md
+      hash: md5
+      md5: 460453f24ea5c20ea88ac8c11a854138
+      size: 4155
+    - path: data/datasets/pl/graph/README_files
+      hash: md5
+      md5: cabe6e2cc1195b673b68dcca8fe4705d.dir
+      size: 25265
+      nfiles: 1
diff --git a/dvc.yaml b/dvc.yaml
@@ -13,16 +13,45 @@ stages:
     matrix:
       model:
         - mmlw-roberta-large
-        # - e5-mistral-7b-instruct
     cmd: >-
-      PYTHONPATH=. python scripts/embed_text.py embedding_model=${item.model}
+      PYTHONPATH=. python scripts/embed/embed_text.py embedding_model=${item.model}
     deps:
-      - scripts/embed_text.py
+      - scripts/embed/embed_text.py
       - configs/embedding.yaml
       - configs/embedding_model/${item.model}.yaml
       - data/datasets/pl/raw
     outs:
-      - data/embeddings/pl-court-raw/${item.model}
+      - data/embeddings/pl-court-raw/${item.model}/all_embeddings
+
+  aggregate_embeddings:
+    matrix:
+      model:
+        - mmlw-roberta-large
+    cmd: >-
+      PYTHONPATH=. python scripts/embed/aggregate_embeddings.py 
+      --embeddings-dir data/embeddings/pl-court-raw/${item.model}/all_embeddings
+    deps:
+      - scripts/embed/aggregate_embeddings.py
+      - data/embeddings/pl-court-raw/${item.model}/all_embeddings
+    outs:
+      - data/embeddings/pl-court-raw/${item.model}/agg_embeddings.pt
+
+
+  generate_graph_dataset:
+    cmd: >-
+      PYTHONPATH=. python scripts/dataset/generate_graph_dataset.py
+      --dataset-dir data/datasets/pl/raw 
+      --embeddings-root-dir data/embeddings/pl-court-raw/mmlw-roberta-large/
+      --target-dir data/datasets/pl/graph
+    deps:
+      - scripts/dataset/generate_graph_dataset.py
+      - juddges/data/pl_court_graph.py
+      - data/datasets/pl/raw 
+      - data/embeddings/pl-court-raw/mmlw-roberta-large/agg_embeddings.pt
+      - data/embeddings/pl-court-raw/mmlw-roberta-large/all_embeddings/config.yaml
+    outs:
+      - data/datasets/pl/graph/data
+      - data/datasets/pl/graph/metadata.yaml
 
   sft:
     matrix: