Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we ignore parsing content at all by adding config key in settings? #846

Open
shahariaazam opened this issue Nov 17, 2019 · 13 comments
Open
Labels
new For new features or options

Comments

@shahariaazam
Copy link
Contributor

We are considering fscrawler so much as our document indexing tools where we are processing more than 32 millions of docs every day (average).

But our usecase is, we are indexing contents from aggregated archive source and that contains various types of files (php script, js script, css script, html/non-html file). Mostly source codes. In that scenario, most of the time Tika parser will faile to parse the document and ultimately that docs won't be indexed at all.

Use case

For example,

14:39:02,423 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/source_code/november/footer.php]  -> XML parse error -> The value of attribute "src" associated with an element type "img" must not contain the '<' character.

File footer.php was

<img src="{$src}">

So Tika failed to parse it because it's not a valid src tag. It can't be. Because the src value would come from PHP variable.

This is just one usecase. Me and my team closely looked into the source code and may be it won't be too hard to bypase the parsing functionality with a configuration key. If you want, I can make a PR.

My suggestions

settings.yaml

ignore_tika_parser: true

And then we will just extract raw contents from the file and index that.

I am open for a discussion in this topics.

Note: If this is already implemented, then I am sorry to raise this issue. May be we couldn't find that in documentation. I don't know.

@shahariaazam
Copy link
Contributor Author

@dadoonet any thoughts about this. It would be so helpful and fscrawler can be used in much extended scope.

@dadoonet
Copy link
Owner

I looked at it briefly and I'm not sure if Tika can deal with that.
I believe that the only way would be to define a specific parser and make FSCrawler totally flexible on the parsers to use.

Which was the initial intention of #498 but we never made it happen.

If you know how to deal with that on Tika side (ie what is the setting to use to ignore such things), please let me know here and I'll see how to implement that.

@shahariaazam
Copy link
Contributor Author

I saw #498 . Actually I was not talking about custom parser. But I was talking about not parsing at all. So the document will get crawled as it does now. But it will be directly indexed the content into ES. No parsing.

@dadoonet
Copy link
Owner

What should be generated then? Only the file metadata? But no content and no metadata about the document content?

@shahariaazam
Copy link
Contributor Author

No. We will just read the file.

File file = new File("C:\\Users\\pankaj\\Desktop\\test.txt"); 
  
  BufferedReader br = new BufferedReader(new FileReader(file)); 
  
  String st; 
  while ((st = br.readLine()) != null) 
    System.out.println(st); 
  } 

My team already found a way to do that. Because in your code it's very simple to ignore Tika and just read the file content and index into ES. Rest of the things (meta data and other stuff) will be same.

@dadoonet
Copy link
Owner

Ok. So you want to send a PR for this?

@shahariaazam
Copy link
Contributor Author

Yes I can. But first let me share with you about what I wanted to achieve.

Go to

TikaDocParser.generate(fsSettings, inputStream, filename, fullFilename, doc, messageDigest, filesize);
}
// We index the data structure
if (isIndexable(doc.getContent(), fsSettings.getFs().getFilters())) {

doc.getContent() is coming from Tika output. Correct? So just in this place we will implement this ignore stuff.

if (fsSettings.getFs().isJsonSupport()) {
                    // https://github.com/dadoonet/fscrawler/issues/5 : Support JSon files
                    doc.setObject(DocParser.asMap(read(inputStream)));
                } else if (fsSettings.getFs().isXmlSupport()) {
                    // https://github.com/dadoonet/fscrawler/issues/185 : Support Xml files
                    doc.setObject(XmlDocParser.generateMap(inputStream));
                }else if(fsSettings.getFs().ignoreTika()){
                    //read the content ***************************
                } else {
                    // Extracting content with Tika
                    TikaDocParser.generate(fsSettings, inputStream, filename, fullFilename, doc, messageDigest, filesize);
                }

Now I think you got my whole point. Did you?

@dadoonet
Copy link
Owner

But I thought you wanted to ignore Tika just in case of error. So you want to skip Tika for evey document actually?
The change makes sense to me. I'd rather call the option skip_tika: true (default to false) though.

@shahariaazam
Copy link
Contributor Author

But I thought you wanted to ignore Tika just in case of error. So you want to skip Tika for evey document actually?

Hahaha no. Not in case of error. But also totally skipping it. That's why I shared the logic. Now you got my point. Yes I want to skip Tika for every document.

At least it will make fscrawler more considerable for various types of project. Currently as I was working to index a very large amount of source codes (PHP, HTML, CSS, JS). But unfortunately Tika failed to parse when it found PHP+HTML mixed stuff.

That's why I came up with this idea.

@shahariaazam
Copy link
Contributor Author

The change makes sense to me. I'd rather call the option skip_tika: true (default to false) though.

Yes. Absolutely perfect. So do I still need to open PR? Or you will keep this one for you?

@dadoonet
Copy link
Owner

It's definitely better if you'd like to contribute to the project.
I'll guide you in the needed steps for it while doing the review.

@shahariaazam
Copy link
Contributor Author

It's definitely better if you'd like to contribute to the project.
I'll guide you in the needed steps for it while doing the review.

Yes of course. I would love to. I always try to contribute to the projects that I love and use frequently. Ok. I will find a suitable time to open a PR.

shahariaazam added a commit to shahariaazam/fscrawler that referenced this issue Nov 25, 2019
…y default skip_tika: false of course.

Relevant Issue dadoonet#846
@shahariaazam
Copy link
Contributor Author

@dadoonet can you please review #858 this PR. If all goes well, then I will also submit another patch for documentation about that new config key skip_tika.

@dadoonet dadoonet added the new For new features or options label Nov 26, 2019
shahariaazam added a commit to shahariaazam/fscrawler that referenced this issue Nov 30, 2019
…y default skip_tika: false of course.

Relevant Issue dadoonet#846
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new For new features or options
Projects
None yet
Development

No branches or pull requests

2 participants