Can we ignore parsing content at all by adding config key in settings? #846

shahariaazam · 2019-11-17T14:47:15Z

We are considering fscrawler so much as our document indexing tools where we are processing more than 32 millions of docs every day (average).

But our usecase is, we are indexing contents from aggregated archive source and that contains various types of files (php script, js script, css script, html/non-html file). Mostly source codes. In that scenario, most of the time Tika parser will faile to parse the document and ultimately that docs won't be indexed at all.

Use case

For example,

14:39:02,423 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/source_code/november/footer.php]  -> XML parse error -> The value of attribute "src" associated with an element type "img" must not contain the '<' character.

File footer.php was

<img src="{$src}">

So Tika failed to parse it because it's not a valid src tag. It can't be. Because the src value would come from PHP variable.

This is just one usecase. Me and my team closely looked into the source code and may be it won't be too hard to bypase the parsing functionality with a configuration key. If you want, I can make a PR.

My suggestions

settings.yaml

ignore_tika_parser: true

And then we will just extract raw contents from the file and index that.

I am open for a discussion in this topics.

Note: If this is already implemented, then I am sorry to raise this issue. May be we couldn't find that in documentation. I don't know.

The text was updated successfully, but these errors were encountered:

shahariaazam · 2019-11-25T09:40:10Z

@dadoonet any thoughts about this. It would be so helpful and fscrawler can be used in much extended scope.

dadoonet · 2019-11-25T09:44:25Z

I looked at it briefly and I'm not sure if Tika can deal with that.
I believe that the only way would be to define a specific parser and make FSCrawler totally flexible on the parsers to use.

Which was the initial intention of #498 but we never made it happen.

If you know how to deal with that on Tika side (ie what is the setting to use to ignore such things), please let me know here and I'll see how to implement that.

shahariaazam · 2019-11-25T09:50:51Z

I saw #498 . Actually I was not talking about custom parser. But I was talking about not parsing at all. So the document will get crawled as it does now. But it will be directly indexed the content into ES. No parsing.

dadoonet · 2019-11-25T09:51:54Z

What should be generated then? Only the file metadata? But no content and no metadata about the document content?

shahariaazam · 2019-11-25T09:55:00Z

No. We will just read the file.

File file = new File("C:\\Users\\pankaj\\Desktop\\test.txt"); 
  
  BufferedReader br = new BufferedReader(new FileReader(file)); 
  
  String st; 
  while ((st = br.readLine()) != null) 
    System.out.println(st); 
  }

My team already found a way to do that. Because in your code it's very simple to ignore Tika and just read the file content and index into ES. Rest of the things (meta data and other stuff) will be same.

dadoonet · 2019-11-25T10:02:55Z

Ok. So you want to send a PR for this?

shahariaazam · 2019-11-25T10:09:59Z

Yes I can. But first let me share with you about what I wanted to achieve.

Go to

fscrawler/core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java

Lines 474 to 478 in 9821622

    
               TikaDocParser.generate(fsSettings, inputStream, filename, fullFilename, doc, messageDigest, filesize); 
        
           } 
        
           // We index the data structure 
        
           if (isIndexable(doc.getContent(), fsSettings.getFs().getFilters())) {

doc.getContent() is coming from Tika output. Correct? So just in this place we will implement this ignore stuff.

if (fsSettings.getFs().isJsonSupport()) {
                    // https://github.com/dadoonet/fscrawler/issues/5 : Support JSon files
                    doc.setObject(DocParser.asMap(read(inputStream)));
                } else if (fsSettings.getFs().isXmlSupport()) {
                    // https://github.com/dadoonet/fscrawler/issues/185 : Support Xml files
                    doc.setObject(XmlDocParser.generateMap(inputStream));
                }else if(fsSettings.getFs().ignoreTika()){
                    //read the content ***************************
                } else {
                    // Extracting content with Tika
                    TikaDocParser.generate(fsSettings, inputStream, filename, fullFilename, doc, messageDigest, filesize);
                }

Now I think you got my whole point. Did you?

dadoonet · 2019-11-25T10:21:20Z

But I thought you wanted to ignore Tika just in case of error. So you want to skip Tika for evey document actually?
The change makes sense to me. I'd rather call the option skip_tika: true (default to false) though.

shahariaazam · 2019-11-25T10:27:23Z

But I thought you wanted to ignore Tika just in case of error. So you want to skip Tika for evey document actually?

Hahaha no. Not in case of error. But also totally skipping it. That's why I shared the logic. Now you got my point. Yes I want to skip Tika for every document.

At least it will make fscrawler more considerable for various types of project. Currently as I was working to index a very large amount of source codes (PHP, HTML, CSS, JS). But unfortunately Tika failed to parse when it found PHP+HTML mixed stuff.

That's why I came up with this idea.

shahariaazam · 2019-11-25T10:28:21Z

The change makes sense to me. I'd rather call the option skip_tika: true (default to false) though.

Yes. Absolutely perfect. So do I still need to open PR? Or you will keep this one for you?

dadoonet · 2019-11-25T10:29:21Z

It's definitely better if you'd like to contribute to the project.
I'll guide you in the needed steps for it while doing the review.

shahariaazam · 2019-11-25T10:32:41Z

It's definitely better if you'd like to contribute to the project.
I'll guide you in the needed steps for it while doing the review.

Yes of course. I would love to. I always try to contribute to the projects that I love and use frequently. Ok. I will find a suitable time to open a PR.

…y default skip_tika: false of course. Relevant Issue dadoonet#846

shahariaazam · 2019-11-25T19:21:43Z

@dadoonet can you please review #858 this PR. If all goes well, then I will also submit another patch for documentation about that new config key skip_tika.

…y default skip_tika: false of course. Relevant Issue dadoonet#846

shahariaazam added a commit to shahariaazam/fscrawler that referenced this issue Nov 25, 2019

Tika parser can be avoided by adding skip_tika: true in the config. B…

d74a803

…y default skip_tika: false of course. Relevant Issue dadoonet#846

shahariaazam mentioned this issue Nov 25, 2019

Skip Tika parsing with skip_tika new option #858

Open

dadoonet added the new For new features or options label Nov 26, 2019

shahariaazam added a commit to shahariaazam/fscrawler that referenced this issue Nov 30, 2019

Tika parser can be avoided by adding skip_tika: true in the config. B…

949b333

…y default skip_tika: false of course. Relevant Issue dadoonet#846

aram535 mentioned this issue Jul 13, 2020

skip_tika option in job configuration is ignored #979

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we ignore parsing content at all by adding config key in settings? #846

Can we ignore parsing content at all by adding config key in settings? #846

shahariaazam commented Nov 17, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

Can we ignore parsing content at all by adding config key in settings? #846

Can we ignore parsing content at all by adding config key in settings? #846

Comments

shahariaazam commented Nov 17, 2019

Use case

My suggestions

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

dadoonet commented Nov 25, 2019

shahariaazam commented Nov 25, 2019

shahariaazam commented Nov 25, 2019