index binary documents such as PDF, MS Office. #164

ShubjeetPal · 2021-08-16T12:14:59Z

ShubjeetPal
Aug 16, 2021

How can I use MeiliSearch to index binary documents such as PDF, Open Office, MS Office in a react Static Web Application where CMS is Strapi.
Any suggestions/ plugins will be helpful

curquiza · 2021-08-16T16:07:47Z

curquiza
Aug 16, 2021
Maintainer

Hello!

You cannot push binary files to MeiliSearch at the moment. You have to extract the text from your file and push the content into MeiliSearch!

@gmourier, for the new feature request 😇

0 replies

gmourier · 2021-08-16T16:40:48Z

gmourier
Aug 16, 2021
Maintainer

Hello @ShubjeetPal 👋

Thanks for your feedback. I've moved your initial issue as a product discussion so that other users can vote and interact directly here about this feature proposal.

A possible workaround could be to extract the text and index it within MeiliSearch but there is a limit on the number of words that can be searched within an attribute. This would force the text to be split into several attributes which is probably not ideal.

0 replies

Sembiance · 2021-08-20T21:54:20Z

Sembiance
Aug 20, 2021

@ShubjeetPal The best open source product I've found to extract text from binary documents is Apache Tika: https://tika.apache.org/

It supports text extraction from tons of binary formats such as PDF, Word, etc: https://tika.apache.org/2.0.0/formats.html

I use the tika-server-standard-2.0.0.jar and run it with:
java -jar tika-server-standard-2.0.0.jar --host 127.0.0.1 --port 9910

Then for every binary file I want to extract text out of I do:

curl -H "Accept: text/plain" -T /path/to/binaryFile.pdf 'http://127.0.0.1:9910'

It spits out plain text which I then feed into meilisearch.

2 replies

smknstd Nov 10, 2021

Hello ! Can you tell more about how you map text into meili's document ?

majortom64 Aug 7, 2023

@Sembiance Once I have the giant blob of plain text, how do I get back a reference to the point in the PDF, MS Word, etc., document, so that I am not just handing the user a giant blob of unformatted text?

majortom64 · 2023-02-26T17:18:31Z

majortom64
Feb 26, 2023

This is still my biggest issue. I have lots of PDF, MS Word, etc. documents that I need to be able to search and index. I understand how I could convert them to plain text documents, but that leaves me with two problems:

How do I import giant blobs of text into Meilisearch?
How do I enable the search to return not just the document, but the index into the formatted version (i.e. so that the user is really looking at the actual document, not just a blob of text)?

2 replies

gmourier Mar 22, 2023
Maintainer

Hello @majortom64

Have you seen docs-scraper? It's a library that allows you to parse html pages and create a structure for the documents in a semantic way within Meilisearch.

I guess you want the same kind of process if you can scrap a title field, subtitle level 1, level 2, paragraphs, etc for your PDF and Word documents, you should be good to go!

majortom64 Aug 7, 2023

@gmourier If I had a tool that parsed all the documents types and then provided the ability to link back to the actual section of the original document, that worked with Meilisearch, I would be good to go. Unfortunately, I know of no such tool and I would not even have a clue as to where to start writing one. Without that, Meilisearch does not really solve any problem for me. I do suspect that I am not alone in needing this functionality, however, most people just search the Web, realize there are no tools for this and move on to other options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

index binary documents such as PDF, MS Office. #164

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Meilisearch

index binary documents such as PDF, MS Office. #164

ShubjeetPal Aug 16, 2021

Replies: 4 comments · 4 replies

curquiza Aug 16, 2021 Maintainer

gmourier Aug 16, 2021 Maintainer

Sembiance Aug 20, 2021

smknstd Nov 10, 2021

majortom64 Aug 7, 2023

majortom64 Feb 26, 2023

gmourier Mar 22, 2023 Maintainer

majortom64 Aug 7, 2023

ShubjeetPal
Aug 16, 2021

Replies: 4 comments 4 replies

curquiza
Aug 16, 2021
Maintainer

gmourier
Aug 16, 2021
Maintainer

Sembiance
Aug 20, 2021

majortom64
Feb 26, 2023

gmourier Mar 22, 2023
Maintainer