Will only parse some PDFs once and then fail #208

The-Curious-Geek · 2024-06-02T03:22:43Z

The-Curious-Geek
Jun 2, 2024

Not sure if this is a bug or if I'm just doing something wrong, so wanted to throw it out here first. I have some PDFs that I can parse the first run of my simple Python implementation, but it throws this error every subsequent time: Error while parsing the file 'C:\blah\file.pdf': Expecting value: line 1 column 1 (char 0)

I can make it work again if I open the original PDF in Calibre and export it (as another PDF). But it will load, parse, and send the data back to me and my local LLM will answer the query on the first run, but then subsequent runs it throws that error. I don't get it. I downloaded the example pdf from the tutorial I'm following and it works every time.

This is the code

parser = LlamaParse(result_type="markdown")
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(".\docs", file_extractor=file_extractor).load_data()
embed_model = resolve_embed_model("local:BAAI/bge-m3")
vector_index = VectorStoreIndex.from_documents(documents=documents, embed_model=embed_model)
query_engine = vector_index.as_query_engine(llm=llm)

result = query_engine.query("What are the plant properties of chamomile?")
print(result)

The error is coming while SimpleDirectoryReader's load_data() function is running. It says it's starting the parsing file under job_id 'the id'. The llama-index version is 0.10.25 and llama-parse is 0.4.0. I'm on Windows 10, using VS Code and it's PS terminal to run stuff. Tried clearing all the pycache dir's in lib of my virtual environment but that didn't fix.

I also opened up the good pdf and a bad one to compare their headers in Notepad++.... the first line that says the PDF version is the same. The second line which looks like gibberish was different. I changed the bad PDF to have the same as the second, saved it, and then the parsing worked... but then it didn't work subsequent times. So any change to the file will also make it work.

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will only parse some PDFs once and then fail #208

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Will only parse some PDFs once and then fail #208

The-Curious-Geek Jun 2, 2024

Replies: 0 comments

The-Curious-Geek
Jun 2, 2024