Feat/data augmentation #13

jamnicki · 2024-04-27T14:09:36Z

No description provided.

* feat: create CI * fix: add push trigger * refactor: rename jobs, add dummy script * fix: run nbdev_prepare * fix: add ruff exclude * style: add empty line at the end

* Add scripts for downloading data from polish court API * Refine logging * Refactor to make mongo bulk writes * Fix typing errors * Add missing dependency * Refine retiries and log warning on invalid pl_court_api params --------- Co-authored-by: Jakub Binkowski <[email protected]>

* Improve error handling in download_pl_content.py * Add dataset dump scrip * Add pl dataset to DVC * Add simple data analysis notebook * Extract text from pl judgements * Refine text extraction and add analysis * Add addtional details download and ingest * Refine extraction and ingest extracted data to mongo * Add script for chunked embeddings --------- Co-authored-by: Jakub Binkowski <[email protected]>

* first information extraction schema * mlflow tracking * add streamlit app * v1 prompt ready * nbdev * get mongo docs * Parse pl judgements (#4) * Improve error handling in download_pl_content.py * Add dataset dump scrip * Add pl dataset to DVC * Add simple data analysis notebook * Extract text from pl judgements * Refine text extraction and add analysis * Add addtional details download and ingest * Refine extraction and ingest extracted data to mongo * Add script for chunked embeddings --------- Co-authored-by: Jakub Binkowski <[email protected]> * first information extraction schema * mlflow tracking * add streamlit app * v1 prompt ready * nbdev * get mongo docs * update nbdev * small fixes --------- Co-authored-by: Jakub Binkowski <[email protected]> Co-authored-by: Jakub Binkowski <[email protected]>

* first information extraction schema * mlflow tracking * add streamlit app * v1 prompt ready * nbdev * get mongo docs * Add chain for transforming user queries into schema * merge artifact --------- Co-authored-by: Łukasz Augustyniak <[email protected]> Co-authored-by: Jakub Binkowski <[email protected]>

* dashboard reformat * notebooks noved to nbs * text analysis * makefile fix and nbdev * docker compsoe and streamlit updates * streamlit update * dashboard update * search for judgements works

…DDGES into feat/data-augmentation

…gh workflow error

* fix non starting postgres after restarts crashes * make dashboard nicer for judgements * show only subset * nbs checkpoints

laugustyniak · 2024-05-15T11:26:57Z

Makefile

@@ -2,7 +2,7 @@ lint_dirs := juddges scripts dashboards tests
 mypy_dirs := juddges scripts dashboards tests

 fix:
-	ruff check $(lint_dirs) --fix
+	ruff check $(lint_dirs) setup.py --fix


I dont see the point not to include this script in lint/format stages. Some IDEs will format the script after the smallest change on save. It can be annoying later

laugustyniak · 2024-05-15T11:40:47Z

juddges/data/qa_pairs_json_parser.py

+from langchain_core.outputs import Generation
+from langchain_core.utils.json import _parse_json
+
+CUSTOM_PARSE_JSON_MARKDOWN = re.compile(


check https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/json/

original JsonOutputParser parses output successfully always when prompt does not contain any other JSON structure, and the parse_json_markdown pattern is greedy which is not nice

https://github.com/langchain-ai/langchain/pull/20305/files#diff-7736e09b3c57a6d2f3803974d6022d46a6c7cb44f8688af5a72bea77d9db5124L139

juddges/data/utils.py

…ntext insights, create index with number of text tokens for judgements-pl

…DDGES into feat/data-augmentation

asawczyn and others added 16 commits February 28, 2024 11:03

feat/CI (#1)

25fc8aa

* feat: create CI * fix: add push trigger * refactor: rename jobs, add dummy script * fix: run nbdev_prepare * fix: add ruff exclude * style: add empty line at the end

feat: add dvc (#3)

e106f62

feat: synthetic data gen nb

d1e5e81

feat: gen test qa pairs sample

201440a

workshop flow and presentation (#8)

5ac08a8

* dashboard reformat * notebooks noved to nbs * text analysis * makefile fix and nbdev * docker compsoe and streamlit updates * streamlit update * dashboard update * search for judgements works

feat: qa pairs generation script

fb67d5a

fix: mypy errors

ec89909

Merge branch 'master' into feat/data-augmentation

c3575b1

fix: nbdev errors

6456609

chore: tidy up code

24f3d64

feat: create synthetic QA LangSmith dataset

c3fb474

chore: Add langdetect library to requirements.txt

6a5f2f5

jamnicki marked this pull request as ready for review May 7, 2024 08:39

fix: data paths after rebase

e2025a7

jamnicki marked this pull request as draft May 8, 2024 18:17

jamnicki added 11 commits May 8, 2024 19:38

fix: hardcoded paths, missing dependencies

d24a229

chore: enable autostage in dvc

4bf7291

feat: add prompt option in gen synthetic script

9ac171e

refactor: synth gen data related models and scripts

71ceb95

feat: add local postgres url getter

50ebc8b

feat: use metadata to generation from context; refactor

9116ed1

fix: add missing packaging package

d927c9b

feat: use metadata to generation from context; refactor

7251c56

Update requirements.txt

13cf8a1

feat: add translation of qa pairs generated in wrong language

bd1068d

Merge branch 'feat/data-augmentation' of https://github.com/pwr-ai/Ju…

f6f910c

…DDGES into feat/data-augmentation

jamnicki and others added 7 commits May 14, 2024 17:26

chore: add setup.py as linter target

d8cce29

feat: add packaging, setuptools and wheel as setup_requires

e639a8c

fix: python-check ci No module packaging error

22a3403

fix: temporarily remove flash-attn package from requirements; occurs …

f941702

…gh workflow error

fix: deepspeed

bf10de2

feat: model cfg handling to gen synth qa script

92a5b4d

Dashboard updates (#11)

5b66459

* fix non starting postgres after restarts crashes * make dashboard nicer for judgements * show only subset * nbs checkpoints

laugustyniak reviewed May 15, 2024

View reviewed changes

jamnicki added 3 commits May 15, 2024 17:58

feat: set synth qa pairs translation as optional

612942c

test: add unittests for SyntheticQAPairs model

18f8a5a

Merge branch 'master' into feat/data-augmentation

084a8c0

jamnicki marked this pull request as ready for review May 15, 2024 18:04

jamnicki added 6 commits May 31, 2024 21:50

feat: gen qa with openai api, add notebooks with synthetic data or co…

7b9d796

…ntext insights, create index with number of text tokens for judgements-pl

Merge branch 'feat/data-augmentation' of https://github.com/pwr-ai/Ju…

9199842

…DDGES into feat/data-augmentation

style: fix formatting

d4f28f2

feat: add sentiment comparison to synth qa dataset analysis

c01b57d

feat: add util caching, synth qa formality analysis

da4395e

feat: push synth qa dataset to hub, add tokenizer util caching

b62b5c8

asawczyn force-pushed the feat/data-augmentation branch from 4302ae9 to b62b5c8 Compare September 2, 2024 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/data augmentation #13

Feat/data augmentation #13

jamnicki commented Apr 27, 2024

laugustyniak May 15, 2024

jamnicki May 15, 2024 •

edited

Loading

laugustyniak May 15, 2024

jamnicki May 15, 2024

Feat/data augmentation #13

Are you sure you want to change the base?

Feat/data augmentation #13

Conversation

jamnicki commented Apr 27, 2024

laugustyniak May 15, 2024

Choose a reason for hiding this comment

jamnicki May 15, 2024 • edited Loading

Choose a reason for hiding this comment

laugustyniak May 15, 2024

Choose a reason for hiding this comment

jamnicki May 15, 2024

Choose a reason for hiding this comment

jamnicki May 15, 2024 •

edited

Loading