Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #66

Open
srolskyi opened this issue Mar 15, 2024 · 7 comments

Comments

@srolskyi
Copy link

srolskyi commented Mar 15, 2024

Fresh installation, setup new environment (python 3.9.18 or 3.12):

serg: ~ : python3 -m venv new_env
serg: ~ : source new_env/bin/activate
(new_env) serg: ~ : pip install bpemb gensim
Collecting bpemb
Downloading bpemb-0.3.4-py3-none-any.whl.metadata (19 kB)
Collecting gensim
Using cached gensim-4.3.2-cp312-cp312-macosx_10_9_universal2.whl
Collecting numpy (from bpemb)
Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 949.1 kB/s eta 0:00:00
Collecting requests (from bpemb)
Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting sentencepiece (from bpemb)
Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting tqdm (from bpemb)
Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 2.6 MB/s eta 0:00:00
Collecting scipy>=1.7.0 (from gensim)
Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (217 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 217.9/217.9 kB 3.3 MB/s eta 0:00:00
Collecting smart-open>=1.8.1 (from gensim)
Downloading smart_open-7.0.1-py3-none-any.whl.metadata (23 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB)
Collecting charset-normalizer<4,>=2 (from requests->bpemb)
Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests->bpemb)
Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests->bpemb)
Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests->bpemb)
Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)
Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.7/13.7 MB 67.8 MB/s eta 0:00:00
Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl (31.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31.4/31.4 MB 59.3 MB/s eta 0:00:00
Downloading smart_open-7.0.1-py3-none-any.whl (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.8/60.8 kB 3.6 MB/s eta 0:00:00
Downloading requests-2.31.0-py3-none-any.whl (62 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 4.4 MB/s eta 0:00:00
Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 42.6 MB/s eta 0:00:00
Downloading tqdm-4.66.2-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 7.3 MB/s eta 0:00:00
Downloading certifi-2024.2.2-py3-none-any.whl (163 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.8/163.8 kB 12.8 MB/s eta 0:00:00
Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 119.4/119.4 kB 10.6 MB/s eta 0:00:00
Downloading idna-3.6-py3-none-any.whl (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 3.9 MB/s eta 0:00:00
Downloading urllib3-2.2.1-py3-none-any.whl (121 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.1/121.1 kB 10.1 MB/s eta 0:00:00
Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl (38 kB)
Installing collected packages: sentencepiece, wrapt, urllib3, tqdm, numpy, idna, charset-normalizer, certifi, smart-open, scipy, requests, gensim, bpemb
Successfully installed bpemb-0.3.4 certifi-2024.2.2 charset-normalizer-3.3.2 gensim-4.3.2 idna-3.6 numpy-1.26.4 requests-2.31.0 scipy-1.12.0 sentencepiece-0.2.0 smart-open-7.0.1 tqdm-4.66.2 urllib3-2.2.1 wrapt-1.16.0

(new_env) serg: ~ : python3 --version
Python 3.12.2

then run python3 -c "from bpemb import BPEmb; bpemb_en = BPEmb(lang='en', dim=100)"

and got error:

_Traceback (most recent call last):
File "", line 1, in
File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/bpemb.py", line 191, in init
self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/util.py", line 78, in load_word2vec_file
vecs = KeyedVectors.load_word2vec_format(word2vec_file, binary=binary)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 1719, in load_word2vec_format
return _load_word2vec_format(
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 2058, in load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/utils.py", line 365, in any2unicode
return str(text, encoding, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any ideas where am I make a mistake?

@stefan-it
Copy link
Contributor

Hey @srolskyi and @bheinzerling ,

I debugged that issue and debug-printed the path for self.emb_file:

$ ls -hl /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-rw-r-- 1 stefan stefan 3,7M Mär 15 16:34 /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin

And it was downloaded from https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz.

However, when I download the archive manually and extract it, it has the following size:

$ ls -hl ~/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-r--r-- 1 stefan stefan 3,9M Mär 19  2018 /home/stefan/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin

With this file I can load the vectors without any problem:

n [1]: from gensim.models import KeyedVectors

In [2]: vecs = KeyedVectors.load_word2vec_format("/home/stefan/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin", binary=True)

In [3]: vecs
Out[3]: <gensim.models.keyedvectors.KeyedVectors at 0x71cbac1c0410>

So I heavily think that the unpacking routines are currently not working and "broken" word embeddings file is then trying to be loaded - causing the error.

@stefan-it
Copy link
Contributor

stefan-it commented Mar 15, 2024

After some more debugging and reading the code:

stefan@ae-13412:~$ curl -LI https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
HTTP/1.1 301 Moved Permanently
Date: Fri, 15 Mar 2024 15:43:13 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.34
Location: https://bpemb.h-its.org/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
Content-Type: text/html; charset=iso-8859-1

HTTP/2 200 
server: nginx
date: Fri, 15 Mar 2024 15:43:14 GMT
content-type: application/gzip
content-length: 3784656
last-modified: Mon, 09 Apr 2018 22:27:16 GMT
etag: "39bfd0-56971e878b900"
accept-ranges: bytes
strict-transport-security: max-age=15768000

At the end, you can see that the redirected request has an application/gzip content type.

However, the current code is expecting:

if headers.get("Content-Type") == "application/x-gzip":

an application/x-gzip content type header.

This is the reason why the archive is not properly extracted.

@bheinzerling I think best option here is to check if gzip is found in the content type header, e.g.:

if "gzip" in headers.get("Content-Type"):

Then the archive is properly downloaded, extracted and loaded :)

@srolskyi
Copy link
Author

thank you @stefan-it for your investigation!
@bheinzerling can we expect some fix in near future? seems it's global issue and no-one can download this files.....

@mahiforu
Copy link

@bheinzerling @stefan-it , thanks for the investigation -> right now our production is not working because we are depending on package.

  1. I know there are no changes from this package -> so resource "https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz" that we downloading the zip changed the content type to application/gzip where as in code we checking for application/x-gzip
    is there any change in resource that we are accessing ?just trying to understand what change is causing this issue suddenly?

  2. Can you please suggest any temporary solution to fix it ?

@stefan-it
Copy link
Contributor

I created a PR for a fix. In the meantime you should be able to use this fixed version with:

git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca

in a requirements.txt file or via pip:

pip3 install --upgrade git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca

When the fix is accepted/merged into upstream here, then @bheinzerling only needs to release a new version.

@bheinzerling
Copy link
Owner

@srolskyi Thanks for reporting this issue!
@stefan-it Thanks even more for debugging and creating a fix!

My guess is that the admins of the server on which BPEmb is hosted updated or migrated something. In any case, thanks to Stefan's fix everything seems to be working again.

I released a new version on PyPI that includes the fix and should resolve this issue:

pip install --upgrade bpemb

Leaving this issue open a bit for visibility

@psydok
Copy link

psydok commented Sep 14, 2024

What version is fix in? 0.3.5?
I'm using version 0.3.0. Same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants