Note, when running generate.py
and asking your first question, it will download the model(s), which for the 6.9B model takes about 15 minutes per 3 pytorch bin files if have 10MB/s download.
If all data has been put into ~/.cache
by HF transformers and GGML files downloaded already and one points to them (e.g. with --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin
), then these following steps (those related to downloading HF models) are not required.
-
Download model and tokenizer of choice
from transformers import AutoTokenizer, AutoModelForCausalLM model_name = 'h2oai/h2ogpt-oasst1-512-12b' model = AutoModelForCausalLM.from_pretrained(model_name) model.save_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.save_pretrained(model_name)
If using GGML files, those should be downloaded separately manually, and point to file path, e.g.
--base_model=llama --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin
. -
Download reward model, unless pass
--score_model='None'
togenerate.py
# and reward model reward_model = 'OpenAssistant/reward-model-deberta-v3-large-v2' from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained(reward_model) model.save_pretrained(reward_model) tokenizer = AutoTokenizer.from_pretrained(reward_model) tokenizer.save_pretrained(reward_model)
-
For LangChain support, download embedding model:
hf_embedding_model = "sentence-transformers/all-MiniLM-L6-v2" model_kwargs = 'cpu' from langchain.embeddings import HuggingFaceEmbeddings embedding = HuggingFaceEmbeddings(model_name=hf_embedding_model, model_kwargs=model_kwargs)
-
For HF inference server and OpenAI, this downloads the tokenizers used for Hugging Face text generation inference server and gpt-3.5-turbo:
import tiktoken encoding = tiktoken.get_encoding("cl100k_base") encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
-
Get gpt-2 tokenizer for summarization token counting
from transformers import AutoTokenizer model_name = 'gpt2' tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.save_pretrained(model_name)
-
Run generate with transformers in Offline Mode
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python generate.py --base_model='h2oai/h2ogpt-oasst1-512-12b' --gradio_offline_level=2 --share=False
Some code is always disabled that involves uploads out of user control: Huggingface telemetry, gradio telemetry, chromadb posthog.
The additional option
--gradio_offline_level=2
changes fonts to avoid download of google fonts. This option disables google fonts for downloading, which is less intrusive than uploading, but still required in air-gapped case. The fonts don't look as nice as google fonts, but ensure full offline behavior.If the front-end can still access internet, but just backend should not, then one can use
--gradio_offline_level=1
for slightly better-looking fonts.Note that gradio attempts to download iframeResizer.contentWindow.min.js, but nothing prevents gradio from working without this. So a simple firewall block is sufficient. For more details, see: AUTOMATIC1111/stable-diffusion-webui#10324.
-
Disable access or port
To ensure nobody can access your gradio server, disable the port via firewall. If that is a hassle, then one can enable authentication by adding to CLI when running
python generate.py
:--auth=[('jon','password')]
with no spaces. Run
python generate.py --help
for more details. -
To fully disable Chroma telemetry, which documented options still do not disable, run:
sp=`python -c 'import site; print(site.getsitepackages()[0])'` sed -i 's/posthog\.capture/return\n posthog.capture/' $sp/chromadb/telemetry/posthog.py
or the equivalent for windows/mac using. Or edit the file manually to just return in the
capture
function. -
To avoid h2oGPT monitoring which elements are clicked in UI, set the ENV
H2OGPT_ENABLE_HEAP_ANALYTICS=False
pass--enable-heap-analytics=False
togenerate.py
. Note that no data or user inputs are included, only raw svelte UI element IDs and nothing from the user inputs or data.