-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we ignore parsing content at all by adding config key in settings? #846
Comments
@dadoonet any thoughts about this. It would be so helpful and |
I looked at it briefly and I'm not sure if Tika can deal with that. Which was the initial intention of #498 but we never made it happen. If you know how to deal with that on Tika side (ie what is the setting to use to ignore such things), please let me know here and I'll see how to implement that. |
I saw #498 . Actually I was not talking about custom parser. But I was talking about not parsing at all. So the document will get crawled as it does now. But it will be directly indexed the content into ES. No parsing. |
What should be generated then? Only the file metadata? But no content and no metadata about the document content? |
No. We will just read the file. File file = new File("C:\\Users\\pankaj\\Desktop\\test.txt");
BufferedReader br = new BufferedReader(new FileReader(file));
String st;
while ((st = br.readLine()) != null)
System.out.println(st);
} My team already found a way to do that. Because in your code it's very simple to ignore Tika and just read the file content and index into ES. Rest of the things (meta data and other stuff) will be same. |
Ok. So you want to send a PR for this? |
Yes I can. But first let me share with you about what I wanted to achieve. Go to fscrawler/core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java Lines 474 to 478 in 9821622
if (fsSettings.getFs().isJsonSupport()) {
// https://github.com/dadoonet/fscrawler/issues/5 : Support JSon files
doc.setObject(DocParser.asMap(read(inputStream)));
} else if (fsSettings.getFs().isXmlSupport()) {
// https://github.com/dadoonet/fscrawler/issues/185 : Support Xml files
doc.setObject(XmlDocParser.generateMap(inputStream));
}else if(fsSettings.getFs().ignoreTika()){
//read the content ***************************
} else {
// Extracting content with Tika
TikaDocParser.generate(fsSettings, inputStream, filename, fullFilename, doc, messageDigest, filesize);
} Now I think you got my whole point. Did you? |
But I thought you wanted to ignore Tika just in case of error. So you want to skip Tika for evey document actually? |
Hahaha no. Not in case of error. But also totally skipping it. That's why I shared the logic. Now you got my point. Yes I want to skip Tika for every document. At least it will make fscrawler more considerable for various types of project. Currently as I was working to index a very large amount of source codes (PHP, HTML, CSS, JS). But unfortunately Tika failed to parse when it found PHP+HTML mixed stuff. That's why I came up with this idea. |
Yes. Absolutely perfect. So do I still need to open PR? Or you will keep this one for you? |
It's definitely better if you'd like to contribute to the project. |
Yes of course. I would love to. I always try to contribute to the projects that I love and use frequently. Ok. I will find a suitable time to open a PR. |
…y default skip_tika: false of course. Relevant Issue dadoonet#846
…y default skip_tika: false of course. Relevant Issue dadoonet#846
We are considering
fscrawler
so much as our document indexing tools where we are processing more than 32 millions of docs every day (average).But our usecase is, we are indexing contents from aggregated archive source and that contains various types of files (php script, js script, css script, html/non-html file). Mostly source codes. In that scenario, most of the time Tika parser will faile to parse the document and ultimately that docs won't be indexed at all.
Use case
For example,
File
footer.php
wasSo Tika failed to parse it because it's not a valid
src
tag. It can't be. Because thesrc
value would come from PHP variable.This is just one usecase. Me and my team closely looked into the source code and may be it won't be too hard to bypase the parsing functionality with a configuration key. If you want, I can make a PR.
My suggestions
settings.yaml
And then we will just extract raw contents from the file and index that.
I am open for a discussion in this topics.
Note: If this is already implemented, then I am sorry to raise this issue. May be we couldn't find that in documentation. I don't know.
The text was updated successfully, but these errors were encountered: