Motivation (s):
- Im just a bit bored and want to get my hands dirty by doing this kind of stuff.
- Patents are a MASSIVE source of free, high quality, and easily accessible text data.
download_patents.py
: This file sends requests to the USPTO to download patents.- example usage:
python download_patents.py --from-date="2010-01-01" --to-date="2010-01-03"
(this will download all patents from 2010-01-01 to 2010-01-03)
- example usage:
- conda create -n patentgpt_env python=3.10 -y && conda activate patentgpt_env
- if on mac: pip uninstall grpcio; conda install -c conda-forge grpcio=1.43.0 (from skypilot docs https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)
- pip install -r requirements.txt
- pip install "skypilot[all]" # (this takes a while)
- pytorch
- huggingface
- skypilot
- download a sample of USPTO
- parse the pdf's so that we can train models on the data
- get skypilot working
- finetune a decoder only model sample dataset
- download the entire USPTO
- see how fast we can train a model on the whole thing
- Can we include image info? patents contain diagrams
- ... the decision tree of stuff to do has become too large. <EOSTOKEN> for now.