Skip to content

Latest commit

 

History

History
83 lines (65 loc) · 4.59 KB

README_offline.md

File metadata and controls

83 lines (65 loc) · 4.59 KB

Offline Mode:

Note, when running generate.py and asking your first question, it will download the model(s), which for the 6.9B model takes about 15 minutes per 3 pytorch bin files if have 10MB/s download.

If all data has been put into ~/.cache by HF transformers and GGML files downloaded already and one points to them (e.g. with --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin), then these following steps (those related to downloading HF models) are not required.

  • Download model and tokenizer of choice

    from transformers import AutoTokenizer, AutoModelForCausalLM
    model_name = 'h2oai/h2ogpt-oasst1-512-12b'
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.save_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.save_pretrained(model_name)

    If using GGML files, those should be downloaded separately manually, and point to file path, e.g. --base_model=llama --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin.

  • Download reward model, unless pass --score_model='None' to generate.py

    # and reward model
    reward_model = 'OpenAssistant/reward-model-deberta-v3-large-v2'
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    model = AutoModelForSequenceClassification.from_pretrained(reward_model)
    model.save_pretrained(reward_model)
    tokenizer = AutoTokenizer.from_pretrained(reward_model)
    tokenizer.save_pretrained(reward_model)
  • For LangChain support, download embedding model:

    hf_embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
    model_kwargs = 'cpu'
    from langchain.embeddings import HuggingFaceEmbeddings
    embedding = HuggingFaceEmbeddings(model_name=hf_embedding_model, model_kwargs=model_kwargs)
  • For HF inference server and OpenAI, this downloads the tokenizers used for Hugging Face text generation inference server and gpt-3.5-turbo:

    import tiktoken
    encoding = tiktoken.get_encoding("cl100k_base")
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
  • Get gpt-2 tokenizer for summarization token counting

    from transformers import AutoTokenizer
    model_name = 'gpt2'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.save_pretrained(model_name)
  • Run generate with transformers in Offline Mode

    HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python generate.py --base_model='h2oai/h2ogpt-oasst1-512-12b' --gradio_offline_level=2 --share=False

    Some code is always disabled that involves uploads out of user control: Huggingface telemetry, gradio telemetry, chromadb posthog.

    The additional option --gradio_offline_level=2 changes fonts to avoid download of google fonts. This option disables google fonts for downloading, which is less intrusive than uploading, but still required in air-gapped case. The fonts don't look as nice as google fonts, but ensure full offline behavior.

    If the front-end can still access internet, but just backend should not, then one can use --gradio_offline_level=1 for slightly better-looking fonts.

    Note that gradio attempts to download iframeResizer.contentWindow.min.js, but nothing prevents gradio from working without this. So a simple firewall block is sufficient. For more details, see: AUTOMATIC1111/stable-diffusion-webui#10324.

  • Disable access or port

    To ensure nobody can access your gradio server, disable the port via firewall. If that is a hassle, then one can enable authentication by adding to CLI when running python generate.py:

    --auth=[('jon','password')]
    

    with no spaces. Run python generate.py --help for more details.

  • To fully disable Chroma telemetry, which documented options still do not disable, run:

    sp=`python -c 'import site; print(site.getsitepackages()[0])'`
    sed -i 's/posthog\.capture/return\n            posthog.capture/' $sp/chromadb/telemetry/posthog.py

    or the equivalent for windows/mac using. Or edit the file manually to just return in the capture function.

  • To avoid h2oGPT monitoring which elements are clicked in UI, set the ENV H2OGPT_ENABLE_HEAP_ANALYTICS=False pass --enable-heap-analytics=False to generate.py. Note that no data or user inputs are included, only raw svelte UI element IDs and nothing from the user inputs or data.