Skip to content

Commit

Permalink
Fix pl-court-raw dataset by filtering records with empty content from
Browse files Browse the repository at this point in the history
  • Loading branch information
binkjakub committed Aug 27, 2024
1 parent 7d40fc5 commit 5afd7e3
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 6 deletions.
4 changes: 2 additions & 2 deletions data/datasets/pl/raw.dvc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
outs:
- md5: 5dd44be2eea852bcce3d0918ff8b97da.dir
size: 10234880729
- md5: 622ba21868561c26fb6877ad95bfb5c5.dir
size: 10234505621
nfiles: 17
hash: md5
path: raw
5 changes: 3 additions & 2 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,9 @@ MONGO_DB_NAME="datasets"
to remote storage:
```shell
PYTHONPATH=. python scripts/dataset/dump_pl_dataset.py \
--file-name data/datasets/pl/raw/raw.parquet
dvc add data/datasets/pl/raw/raw.parquet && dvc push
--file-name data/datasets/pl/raw/raw.parquet \
--filter-empty-content
dvc add data/datasets/pl/raw && dvc push
```
7. Generate dataset card for `pl-court-raw`
```shell
Expand Down
3 changes: 1 addition & 2 deletions scripts/dataset/dump_pl_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,11 @@ def main(
collection = get_mongo_collection(mongo_uri=mongo_uri)

if filter_empty_content:
query = {"content": {"$ne": True}}
query = {"content": {"$ne": None}}
else:
query = {}

num_docs = collection.count_documents(query)

dumped_data = list(file_name.parent.glob("*.parquet"))
start_offset = 0
if dumped_data:
Expand Down

0 comments on commit 5afd7e3

Please sign in to comment.