Will only parse some PDFs once and then fail #208
Unanswered
The-Curious-Geek
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Not sure if this is a bug or if I'm just doing something wrong, so wanted to throw it out here first. I have some PDFs that I can parse the first run of my simple Python implementation, but it throws this error every subsequent time: Error while parsing the file 'C:\blah\file.pdf': Expecting value: line 1 column 1 (char 0)
I can make it work again if I open the original PDF in Calibre and export it (as another PDF). But it will load, parse, and send the data back to me and my local LLM will answer the query on the first run, but then subsequent runs it throws that error. I don't get it. I downloaded the example pdf from the tutorial I'm following and it works every time.
This is the code
The error is coming while SimpleDirectoryReader's load_data() function is running. It says it's starting the parsing file under job_id 'the id'. The llama-index version is 0.10.25 and llama-parse is 0.4.0. I'm on Windows 10, using VS Code and it's PS terminal to run stuff. Tried clearing all the pycache dir's in lib of my virtual environment but that didn't fix.
I also opened up the good pdf and a bad one to compare their headers in Notepad++.... the first line that says the PDF version is the same. The second line which looks like gibberish was different. I changed the bad PDF to have the same as the second, saved it, and then the parsing worked... but then it didn't work subsequent times. So any change to the file will also make it work.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions