[wip] bitcoin politician automation #37
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#36
How it works currently:
The entry point is the
run.py
file. Run withpython run.py
You can check the current output at
sample_house.md
andsample_senate.md
(should be reproducible when given the api key inmember.py
)run.py
first fetches all members of the current congress fromhttps://api.congress.gov/v3
and creates a folder for each member inside/data
with a json that details that member's descriptions.Then the disclosures from both House and Senate is scraped from (disclosures-clerk.house.gov) and (efdsearch.senate.gov) respectively for a given year defined in
run.py
, and the disclosures are parsed and added into the json files of the members (for the ones that are directly parsable)For the disclosures that contain pdfs/images, the images are saved inside the folder of the member.
The current outputted markdown files ignore image disclosures, hence why
sample_house.md
is empty.There are two TODOs:
For the first TODO, the code is already implemented in
extract.py
. I think a VLM (visual language model) is suited for this task. I've tried plain OCR and that was really bad, and I'm pretty sure SOTA for this stuff is just VLMs. Doing a few shot prompt asking for parsed json should do the trick. Problem is, I don't have an OpenAI API key to test and run this. Not too sure what to do here.For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....
small note:
The
interchangable_names.json
is because often these members have/use different names when registering for member of congress and when uploading their disclosures.For example, GONZALES, ERNEST
his name from
https://api.congress.gov/v3
is "GONZALES, TONY" while his disclosures use "GONZALES, ERNEST".I decided to get around this by writing a checker in
run.py
that asks if the two names are the same person which requires you to manually input the interchangeable name into that json...