Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've added a document loader that will read and parse MS office file types, .doc, .docx, .xls, .xlsx, .ppt, .pptx. Turns out it's a bit more complicated than I expected so I didn't extract the texts but all the doc data and parse it in the schema.Document.PageContent.
For excel files there is metadata that will extract sheets and numerate them. The docx and pptx are just xml so I didn't extract the text just dumped the xml into PageContent, so at later date maybe somehow who understands the file formats better than me can build a decent document structure into schema.Document{}.
At the same time I think there are some advantages of llm having access to entire document structure not just the text strings.
PR Checklist
memory: add interfaces for X, Y
orutil: add whizzbang helpers
).Fixes #123
).golangci-lint
checks.