Update ingest_service.py to fix issue Error: 'utf-8' codec can't decode #1171

yaziciali · 2023-11-06T05:58:01Z

To fix issue: #1166

Error: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte #1166

To fix issue: zylon-ai#1166 Error: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte zylon-ai#1166

pabloogc · 2023-11-06T13:26:26Z

private_gpt/server/ingest/ingest_service.py

@@ -1,4 +1,5 @@
 import tempfile
+import chardet # Chardet must be put in requirements or manually install with pip install chardet


You should add the dependency,

poetry add chardet

pabloogc · 2023-11-06T13:28:36Z

private_gpt/server/ingest/ingest_service.py

@@ -77,7 +78,10 @@ def ingest(self, file_name: str, file_data: AnyStr | Path) -> list[IngestedDoc]:
            # Read as a plain text
            string_reader = StringIterableReader()
            if isinstance(file_data, Path):
-                text = file_data.read_text()
+                with open(file_data, 'rb') as f2:


Try to give this variables proper names. Maybe file_handle instead of f2 and charset instead of result2

I would go even further, and could you also try to avoid to read the file in it's entirety to detect the charset?

More information in chardet documentation: https://chardet.readthedocs.io/en/latest/usage.html#advanced-usage

While the existing implementation work, it is suboptimal in the sense that it reads the file in its entirety before re-reading it?

One could also do the following: read the file in binary mode, store it in text_bin, and then run chardet on it, so that we have text = text_bin.decode(encoding=chardet.detect(text_bin))

This is doing only a single read, and it re-uses the buffer in memory instead of re-reading it.

pabloogc · 2023-11-06T13:33:02Z

Make sure to run the formatter and lint make check

github-actions · 2023-11-22T05:45:41Z

Stale pull request

Update ingest_service.py to fix issue Error: 'utf-8' codec can't decode

f6d80f5

To fix issue: zylon-ai#1166 Error: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte zylon-ai#1166

pabloogc requested changes Nov 6, 2023

View reviewed changes

github-actions bot added the stale label Nov 22, 2023

github-actions bot closed this Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ingest_service.py to fix issue Error: 'utf-8' codec can't decode #1171

Update ingest_service.py to fix issue Error: 'utf-8' codec can't decode #1171

yaziciali commented Nov 6, 2023

pabloogc Nov 6, 2023

pabloogc Nov 6, 2023

lopagela Nov 6, 2023 •

edited

Loading

pabloogc commented Nov 6, 2023

github-actions bot commented Nov 22, 2023

		@@ -1,4 +1,5 @@
		import tempfile
		import chardet # Chardet must be put in requirements or manually install with pip install chardet

Update ingest_service.py to fix issue Error: 'utf-8' codec can't decode #1171

Update ingest_service.py to fix issue Error: 'utf-8' codec can't decode #1171

Conversation

yaziciali commented Nov 6, 2023

pabloogc Nov 6, 2023

Choose a reason for hiding this comment

pabloogc Nov 6, 2023

Choose a reason for hiding this comment

lopagela Nov 6, 2023 • edited Loading

Choose a reason for hiding this comment

pabloogc commented Nov 6, 2023

github-actions bot commented Nov 22, 2023

lopagela Nov 6, 2023 •

edited

Loading