Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] bitcoin politician automation #37

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

geohotstan
Copy link

@geohotstan geohotstan commented May 25, 2024

#36
How it works currently:
The entry point is the run.py file. Run with python run.py
You can check the current output at sample_house.md and sample_senate.md (should be reproducible when given the api key in member.py)

run.py first fetches all members of the current congress from https://api.congress.gov/v3 and creates a folder for each member inside /data with a json that details that member's descriptions.
Then the disclosures from both House and Senate is scraped from (disclosures-clerk.house.gov) and (efdsearch.senate.gov) respectively for a given year defined in run.py, and the disclosures are parsed and added into the json files of the members (for the ones that are directly parsable)
For the disclosures that contain pdfs/images, the images are saved inside the folder of the member.

The current outputted markdown files ignore image disclosures, hence why sample_house.md is empty.
There are two TODOs:

  • extract disclosure in json form from images so that it can be rendered into markdown
  • parse out only cryptocurrency related holdings from json
  • cleanup

For the first TODO, the code is already implemented in extract.py. I think a VLM (visual language model) is suited for this task. I've tried plain OCR and that was really bad, and I'm pretty sure SOTA for this stuff is just VLMs. Doing a few shot prompt asking for parsed json should do the trick. Problem is, I don't have an OpenAI API key to test and run this. Not too sure what to do here.

For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....


small note:
The interchangable_names.json is because often these members have/use different names when registering for member of congress and when uploading their disclosures.
For example, GONZALES, ERNEST
his name from https://api.congress.gov/v3 is "GONZALES, TONY" while his disclosures use "GONZALES, ERNEST".

I decided to get around this by writing a checker in run.py that asks if the two names are the same person which requires you to manually input the interchangeable name into that json...

@geohotstan geohotstan changed the title [wip] bitchoin politician automation [wip] bitcoin politician automation May 25, 2024
@geohotstan
Copy link
Author

geohotstan commented May 25, 2024

https://gist.github.com/geohotstan/18ceeb2ee6a5fd965ac61c99ee6b7839
Here's the sample_senate.md since github diff doesn't like big files

@jlopp
Copy link
Owner

jlopp commented May 26, 2024

Nice; making good progress!

For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....

To be clear, I don't think the determination of which tickers / asset names are considered "crypto" needs to be dynamic. I'm perfectly happy with it being a statically set list, because that won't require frequent updates.

Problem is, I don't have an OpenAI API key to test and run this.

I will gladly set up an OpenAI account with API access if you think that will help accomplish this goal. Just let me know the best way to securely share an API key with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants