[wip] bitcoin politician automation #37

geohotstan · 2024-05-25T06:25:09Z

#36
How it works currently:
The entry point is the run.py file. Run with python run.py
You can check the current output at sample_house.md and sample_senate.md (should be reproducible when given the api key in member.py)

run.py first fetches all members of the current congress from https://api.congress.gov/v3 and creates a folder for each member inside /data with a json that details that member's descriptions.
Then the disclosures from both House and Senate is scraped from (disclosures-clerk.house.gov) and (efdsearch.senate.gov) respectively for a given year defined in run.py, and the disclosures are parsed and added into the json files of the members (for the ones that are directly parsable)
For the disclosures that contain pdfs/images, the images are saved inside the folder of the member.

The current outputted markdown files ignore image disclosures, hence why sample_house.md is empty.
There are two TODOs:

extract disclosure in json form from images so that it can be rendered into markdown
parse out only cryptocurrency related holdings from json
cleanup

For the first TODO, the code is already implemented in extract.py. I think a VLM (visual language model) is suited for this task. I've tried plain OCR and that was really bad, and I'm pretty sure SOTA for this stuff is just VLMs. Doing a few shot prompt asking for parsed json should do the trick. Problem is, I don't have an OpenAI API key to test and run this. Not too sure what to do here.

For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....

small note:
The interchangable_names.json is because often these members have/use different names when registering for member of congress and when uploading their disclosures.
For example, GONZALES, ERNEST
his name from https://api.congress.gov/v3 is "GONZALES, TONY" while his disclosures use "GONZALES, ERNEST".

I decided to get around this by writing a checker in run.py that asks if the two names are the same person which requires you to manually input the interchangeable name into that json...

geohotstan · 2024-05-25T06:29:12Z

https://gist.github.com/geohotstan/18ceeb2ee6a5fd965ac61c99ee6b7839
Here's the sample_senate.md since github diff doesn't like big files

jlopp · 2024-05-26T09:55:24Z

Nice; making good progress!

For the second TODO, it seems LLMs aren't the best for this... I tried gpt4 by asking if ARKK is related to cryptocurrency and it wasn't sure. (ARKK holds coinbase in it's portfolio), so I need to think of another way....

To be clear, I don't think the determination of which tickers / asset names are considered "crypto" needs to be dynamic. I'm perfectly happy with it being a statically set list, because that won't require frequent updates.

Problem is, I don't have an OpenAI API key to test and run this.

I will gladly set up an OpenAI account with API access if you think that will help accomplish this goal. Just let me know the best way to securely share an API key with you.

geohotstan added 3 commits May 25, 2024 13:07

working

d90807d

rename HoR to house

b14e985

enable name checking

139f5e4

geohotstan changed the title ~~[wip] bitchoin politician automation~~ [wip] bitcoin politician automation May 25, 2024

oops

0e52744

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] bitcoin politician automation #37

[wip] bitcoin politician automation #37

geohotstan commented May 25, 2024 •

edited

Loading

geohotstan commented May 25, 2024 •

edited

Loading

jlopp commented May 26, 2024

[wip] bitcoin politician automation #37

Are you sure you want to change the base?

[wip] bitcoin politician automation #37

Conversation

geohotstan commented May 25, 2024 • edited Loading

geohotstan commented May 25, 2024 • edited Loading

jlopp commented May 26, 2024

geohotstan commented May 25, 2024 •

edited

Loading

geohotstan commented May 25, 2024 •

edited

Loading