A system for scraping, downloading, and analyzing cryptocurrency holdings disclosed in Congressional financial disclosures.
This toolkit scrapes financial disclosures from both House and Senate websites, downloads the documents, extracts text content, and analyzes them for cryptocurrency holdings using GPT-4o.
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
playwright install chromium
- Install Tesseract OCR:
- Linux:
sudo apt-get install tesseract-ocr
- Mac:
brew install tesseract
- Windows: Download installer from GitHub
- Set up environment variables in
.env
:
CONGRESS_API_KEY=your-congress-api-key
OPENAI_API_KEY=your-openai-api-key
Run the complete pipeline:
python main.py
This will sequentially:
- Scrape House disclosures
- Analyze House documents for crypto holdings
- Scrape Senate disclosures
- Analyze Senate documents for crypto holdings
Results are saved to house_disclosures_analyzed.csv
and senate_disclosures_analyzed.csv
Here's an example of a detected crypto holding from Rep. Mike Collins' Annual Disclosure:
{
"found": true,
"assets": [
"Velodrome"
],
"quotes": [
"Velodrome [CT] S 06/24/2024 06/24/2024 $1,001 - $15,000"
]
}
src/
├── house_scrape.py # House disclosure website scraper
├── house_analysis.py # House document analyzer
├── senate_scrape.py # Senate disclosure website scraper
├── senate_analysis.py # Senate document analyzer
└── config.py # Crypto asset configurations
main.py # Entry point
.env # Environment variables
- Uses Playwright for web scraping
- PyPDF2 with Tesseract OCR fallback for text extraction
- GPT-4o for crypto asset detection
- Includes retry logic and rate limiting
- Comprehensive logging to console (INFO) and logs.txt (DEBUG)