-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip Tika parsing with skip_tika
new option
#858
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks overall good. I left some minor comments.
Could you change that?
Then also please add some tests (I can help if you need) and some documentation for this new setting.
Thanks a lot!
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Outdated
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Outdated
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Outdated
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Outdated
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Outdated
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Outdated
Show resolved
Hide resolved
I will change the codes as per your feedback. For Also for the documentation I will send another PR after fixing this one. |
Could you please rebase your code on master? It will most likely remove the first commit as I merged it today and should make Travis-CI happy with your change so we can move forward with tests. |
skip_tika
new option
In your commit message, could you also add:
So when merged, this will close automatically the initial issue. |
Sorry to mention that in my latest changes. |
Could you rebase your branch on master and solve the conflicts? |
Done. Please check. |
I don't think it's a rebase. Did you really rebase on master? |
Yes it was rebase. |
Did any comment seems missing? |
…y default skip_tika: false of course. Relevant Issue dadoonet#846
If there is any confusion, can you see this? https://github.com/shahariaazam/fscrawler/commits/issue-846-v2 ? If it's OK, then may be I can push this to this PR again. |
It looks like to me that you merged somewhat the master branch in your branch but did not rebase on master. |
https://github.com/shahariaazam/fscrawler/commits/issue-846-v2 Did you see this? |
de68a7d
to
515dae0
Compare
I am extremely Sorry @dadoonet . It seems I made a mistake. There has been some merge. I just fixed and rebased on master. It should be perfectly fine now. Please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing to change.
I'm going also to send you a PR to start adding some tests. I'm not sure yet how this will work at the end, specifically because of the integration tests which might fail now. Let see...
@@ -469,6 +469,9 @@ private void indexFile(FileAbstractModel fileAbstractModel, ScanStatistic stats, | |||
} else if (fsSettings.getFs().isXmlSupport()) { | |||
// https://github.com/dadoonet/fscrawler/issues/185 : Support Xml files | |||
doc.setObject(XmlDocParser.generateMap(inputStream)); | |||
} else if (fsSettings.getFs().isSkipTika()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of it today and adding some tests. I found out that we should move this logic in TikaDocParser#generate
method.
Two reasons:
- We can more easily unit test this
UploadAPI
class (Rest API) must also see this setting. UploadAPI callsgenerate
method. That's why we should call the logic from there.
Could you change that please?
Hey @shahariaazam! Would you like to update your PR? |
@shahariaazam any news? |
This weekend I will update this PR. Sorry for the delay. Load of work recently. :) |
No worries! I just wanted to make sure you're still interested in this. 😉 |
@shahariaazam Any spare time to move this forward? |
Hey! Is that something you would like to bring in? |
As we were continuing the discussion on #846. I just made the changes. In my case, it's working as initially it was proposed.
Here is the outcome for this changes.
What happened before this changes?
Tika can't parse this file content as it's not valid HTML. This similar case can happen if you specially work .
Trace (before making this changes)
What benefits from this changes?
We can skip Tika completely by adding config
skip_tika: true|false
. So if we know that our purpose doesn't require Tika parsing then we can go ahead with raw content from the file.Trace (after making this changes)
Summery
In certain case, Tika limits the scope of use of
fscrawler
. But with thisskip_tika: true
we can really usefscrawler
on any kinds of file system indexing.