Skip to content

Commit

Permalink
download wikitext bug fix
Browse files Browse the repository at this point in the history
  • Loading branch information
jstzwj committed Jul 8, 2024
1 parent e16f9cb commit 0c3935f
Show file tree
Hide file tree
Showing 14 changed files with 511 additions and 126 deletions.
11 changes: 11 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# To get started with Dependabot version updates, you'll need to specify which
# package ecosystems to update and where the package manifests are located.
# Please see the documentation for all configuration options:
# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates

version: 2
updates:
- package-ecosystem: "pip" # See documentation for possible values
directory: "/" # Location of package manifests
schedule:
interval: "weekly"
38 changes: 38 additions & 0 deletions .github/workflows/dev.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Olah GitHub Actions for Development
run-name: Olah GitHub Actions for Development
on:
push:
branches: [ "dev" ]

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]

steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: Set up Apache Arrow
run: |
sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dev libarrow-glib-dev libparquet-dev libparquet-glib-dev
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install Olah
run: |
cd ${{ github.workspace }}
pip install --upgrade pip
pip install -e .
- name: Test Olah
run: |
cd ${{ github.workspace }}
python -m unittest discover olah/tests
54 changes: 54 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: Olah GitHub Actions to release
run-name: Olah GitHub Actions to release
on:
push:
tags:
- "[0-9]+.[0-9]+.[0-9]+"

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.12"]

steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: Set up Apache Arrow
run: |
sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dev libarrow-glib-dev libparquet-dev libparquet-glib-dev
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install Olah
run: |
cd ${{ github.workspace }}
pip install --upgrade pip
pip install -e .
- name: Test Olah
run: |
cd ${{ github.workspace }}
python -m unittest discover olah/tests
- name: Build Olah
run: |
cd ${{ github.workspace }}
pip install build
python -m build
- name: Release
uses: "marvinpinto/action-automatic-releases@latest"
with:
repo_token: "${{ secrets.GITHUB_TOKEN }}"
prerelease: true
files: |
dist/*.tar.gz
dist/*.whl
26 changes: 25 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Olah is self-hosted lightweight huggingface mirror service. `Olah` means `hello`

Other languages: [中文](README_zh.md)
## Features
* Huggingface Data Cache
* Models mirror
* Datasets mirror
* Spaces mirror
Expand Down Expand Up @@ -42,21 +43,42 @@ python -m olah.server
```

Then set the Environment Variable `HF_ENDPOINT` to the mirror site (Here is http://localhost:8090).

Linux:
```bash
export HF_ENDPOINT=http://localhost:8090
```

Windows Powershell:
```bash
$env:HF_ENDPOINT = "http://localhost:8090"
```

Starting from now on, all download operations in the HuggingFace library will be proxied through this mirror site.
```python
from huggingface_hub import snapshot_download

snapshot_download(repo_id='Qwen/Qwen-7B', repo_type='model',
local_dir='./model_dir', resume_download=True,
max_workers=8)
```

Or you can download models and datasets by using huggingface cli.
```bash
pip install -U huggingface_hub
```

You can check the path `./repos` which stores all cached datasets and models.
Download GPT2:
```bash
huggingface-cli download --resume-download openai-community/gpt2 --local-dir gpt2
```

Download WikiText:
```bash
huggingface-cli download --repo-type dataset --resume-download Salesforce/wikitext --local-dir wikitext
```

You can check the path `./repos`, in which olah stores all cached datasets and models.

## Start the server
Run the command in the console:
Expand All @@ -75,6 +97,8 @@ The default mirror cache path is `./repos`, you can change it by `--repos-path`
python -m olah.server --host localhost --port 8090 --repos-path ./hf_mirrors
```

**Note that the cached data between different versions cannot be migrated. Please delete the cache folder before upgrading to the latest version of Olah.**

## Future Work

* Authentication
Expand Down
27 changes: 26 additions & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
Olah是一种自托管的轻量级HuggingFace镜像服务。`Olah`在丘丘人语中意味着`你好`

## 特性
* 数据缓存,减少下载流量
* 模型镜像
* 数据集镜像
* 空间镜像

## 安装

Expand Down Expand Up @@ -39,10 +41,16 @@ python -m olah.server
```

然后将环境变量`HF_ENDPOINT`设置为镜像站点(这里是http://localhost:8090)。
Linux:
```bash
export HF_ENDPOINT=http://localhost:8090
```

Windows Powershell:
```bash
$env:HF_ENDPOINT = "http://localhost:8090"
```

从现在开始,HuggingFace库中的所有下载操作都将通过此镜像站点代理进行。
```python
from huggingface_hub import snapshot_download
Expand All @@ -53,7 +61,22 @@ snapshot_download(repo_id='Qwen/Qwen-7B', repo_type='model',

```

您可以检查存储所有缓存的数据集和模型的路径`./repos`
或者你也可以使用huggingface cli直接下载模型和数据集.
```bash
pip install -U huggingface_hub
```

下载GPT2:
```bash
huggingface-cli download --resume-download openai-community/gpt2 --local-dir gpt2
```

下载WikiText:
```bash
huggingface-cli download --repo-type dataset --resume-download Salesforce/wikitext --local-dir wikitext
```

您可以查看路径`./repos`,其中存储了所有数据集和模型的缓存。

## 启动服务器
在控制台运行以下命令:
Expand All @@ -72,6 +95,8 @@ python -m olah.server --host localhost --port 8090
python -m olah.server --host localhost --port 8090 --repos-path ./hf_mirrors
```

**注意,不同版本之间的缓存数据不能迁移,请删除缓存文件夹后再进行olah的升级**

## 许可证

olah采用MIT许可证发布。
Expand Down
14 changes: 12 additions & 2 deletions olah/configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@
import fnmatch

DEFAULT_PROXY_RULES = [
{
"repo": "*",
"allow": True,
"use_re": False
},
{
"repo": "*/*",
"allow": True,
Expand All @@ -14,6 +19,11 @@
]

DEFAULT_CACHE_RULES = [
{
"repo": "*",
"allow": True,
"use_re": False
},
{
"repo": "*/*",
"allow": True,
Expand Down Expand Up @@ -87,7 +97,7 @@ def __init__(self, path: Optional[str] = None) -> None:
self.mirror_lfs_url = "http://localhost:8090"

# accessibility
self.offline = True
self.offline = False
self.proxy = OlahRuleList.from_list(DEFAULT_PROXY_RULES)
self.cache = OlahRuleList.from_list(DEFAULT_CACHE_RULES)

Expand All @@ -100,7 +110,7 @@ def empty_str(self, s: str) -> Optional[str]:
else:
return s

def read_toml(self, path: str):
def read_toml(self, path: str) -> None:
config = toml.load(path)

if "basic" in config:
Expand Down
3 changes: 1 addition & 2 deletions olah/constants.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@


WORKER_API_TIMEOUT = 15
CHUNK_SIZE = 4096
LFS_FILE_BLOCK = 64 * 1024 * 1024
LFS_FILE_BLOCK = 64 * 1024 * 1024
Loading

0 comments on commit 0c3935f

Please sign in to comment.