Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/data augmentation #13

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open

Feat/data augmentation #13

wants to merge 44 commits into from

Conversation

jamnicki
Copy link
Collaborator

No description provided.

asawczyn and others added 16 commits February 28, 2024 11:03
* feat: create CI

* fix: add push trigger

* refactor: rename jobs, add dummy script

* fix: run nbdev_prepare

* fix: add ruff exclude

* style: add empty line at the end
* Add scripts for downloading data from polish court API

* Refine logging

* Refactor to make mongo bulk writes

* Fix typing errors

* Add missing dependency

* Refine retiries and log warning on invalid pl_court_api params

---------

Co-authored-by: Jakub Binkowski <[email protected]>
* Improve error handling in download_pl_content.py

* Add dataset dump scrip

* Add pl dataset to DVC

* Add simple data analysis notebook

* Extract text from pl judgements

* Refine text extraction and add analysis

* Add addtional details download and ingest

* Refine extraction and ingest extracted data to mongo

* Add script for chunked embeddings

---------

Co-authored-by: Jakub Binkowski <[email protected]>
* first information extraction schema

* mlflow tracking

* add streamlit app

* v1 prompt ready

* nbdev

* get mongo docs

* Parse pl judgements (#4)

* Improve error handling in download_pl_content.py

* Add dataset dump scrip

* Add pl dataset to DVC

* Add simple data analysis notebook

* Extract text from pl judgements

* Refine text extraction and add analysis

* Add addtional details download and ingest

* Refine extraction and ingest extracted data to mongo

* Add script for chunked embeddings

---------

Co-authored-by: Jakub Binkowski <[email protected]>

* first information extraction schema

* mlflow tracking

* add streamlit app

* v1 prompt ready

* nbdev

* get mongo docs

* update nbdev

* small fixes

---------

Co-authored-by: Jakub Binkowski <[email protected]>
Co-authored-by: Jakub Binkowski <[email protected]>
* first information extraction schema

* mlflow tracking

* add streamlit app

* v1 prompt ready

* nbdev

* get mongo docs

* Add chain for transforming user queries into schema

* merge artifact

---------

Co-authored-by: Łukasz Augustyniak <[email protected]>
Co-authored-by: Jakub Binkowski <[email protected]>
* dashboard reformat

* notebooks noved to nbs

* text analysis

* makefile fix and nbdev

* docker compsoe and streamlit updates

* streamlit update

* dashboard update

* search for judgements works
@jamnicki jamnicki marked this pull request as ready for review May 7, 2024 08:39
@jamnicki jamnicki marked this pull request as draft May 8, 2024 18:17
@@ -2,7 +2,7 @@ lint_dirs := juddges scripts dashboards tests
mypy_dirs := juddges scripts dashboards tests

fix:
ruff check $(lint_dirs) --fix
ruff check $(lint_dirs) setup.py --fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@jamnicki jamnicki May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see the point not to include this script in lint/format stages. Some IDEs will format the script after the smallest change on save. It can be annoying later

from langchain_core.outputs import Generation
from langchain_core.utils.json import _parse_json

CUSTOM_PARSE_JSON_MARKDOWN = re.compile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original JsonOutputParser parses output successfully always when prompt does not contain any other JSON structure, and the parse_json_markdown pattern is greedy which is not nice

https://github.com/langchain-ai/langchain/pull/20305/files#diff-7736e09b3c57a6d2f3803974d6022d46a6c7cb44f8688af5a72bea77d9db5124L139

juddges/data/utils.py Show resolved Hide resolved
@jamnicki jamnicki marked this pull request as ready for review May 15, 2024 18:04
@asawczyn asawczyn force-pushed the feat/data-augmentation branch from 4302ae9 to b62b5c8 Compare September 2, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants