Install pre-commit and run pre-commit run --all-files
There are two ways to download the files -
- URL : Document href from the document metadata
https://api.sustainabilityreportingnavigator.com/api/documents/{doc_id}/download
: Provide the document id in the url
Script folder : srn_scrape
srn_scrape/srn_scraping.py
: the main script to download the pdf documents but this will take time.
srn_scrape/srn_scrapping_async.py
: a faster script to download pdf files using asyncio but there are some error handling required,
if you are running this script then run srn_scrape/re-download_broken_files.py
to re-download the files for which the download failed
srn_scrape/test_pdfs.py
: gives a count of good files and bad files (broken pdf files).
This script will convert the downloaded pdfs into json format using the viper parser.
Install viper paser from here
checkout branch - viper-optimizations
- Set the file paths in viper_config.yaml
- Run the python script :
python viper_parser.py --config /cluster/home/repo/my_llm_experiments/esrs_data_scraping/viper_config.yaml --cuda-ids 0 1 2 3 4 5 6 --num_workers 8
- set the path locations in main.yaml
- Run the python script :
python main.py --config /cluster/home/repo/my_llm_experiments/esrs_data_scraping/main.yaml
Notes :
- All functions with prefix
collect_
inmain.py
downloads the data from the api and pushes the data to the sqlite db. - The model_name = "gpt-3.5-turbo" - is a paid api for first run
convert_pdf_est_json
function in main.py to find out the cost. json_to_docx
function inmain.py
converts the parsed json file into docx documents with a unique identifier
- Download the html file from here ,
esrs_requirement_main.json
is the output file generated with this script - Run
python parse_esrs_requirements/main.py
to create theparse_esrs_requirements/esrs_requirement_main.json
file ar16_table_gpt_generated.json
is the parsed AR 16 table from the ESRS requirements.