Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip Tika parsing with skip_tika new option #858

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

shahariaazam
Copy link
Contributor

@shahariaazam shahariaazam commented Nov 25, 2019

As we were continuing the discussion on #846. I just made the changes. In my case, it's working as initially it was proposed.

Here is the outcome for this changes.

<!-- File: footer (copy).html -->
<img src="{$src}">

What happened before this changes?
Tika can't parse this file content as it's not valid HTML. This similar case can happen if you specially work .

Trace (before making this changes)

00:17:23,197 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='footer (copy).html', file=true, directory=false, lastModifiedDate=2019-11-25T23:24:46, creationDate=2019-11-25T23:24:46, accessDate=2019-11-26T00:17:01, path='/tmp/es', owner='shaharia', group='shaharia', permissions=664, extension='html', fullpath='/tmp/es/footer (copy).html', size=20}
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/footer (copy).html], includes = [null], excludes = [[*/~*]]
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], excludes = [[*/~*]]
00:17:23,198 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
00:17:23,198 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], includes = [null]
00:17:23,198 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
00:17:23,198 DEBUG [f.p.e.c.f.FsParserAbstract] [/footer (copy).html] can be indexed: [true]
00:17:23,198 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /footer (copy).html
00:17:23,198 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/tmp/es],[footer (copy).html]
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:17:23,198 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/tmp/es/footer (copy).html]
00:17:23,198 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
00:17:23,390 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
00:17:23,390 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
00:17:23,390 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
00:17:23,390 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing fscrawler_tmp_es/eeb62f12e25aa977e021de3beff734?pipeline=null
00:17:23,390 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "\n\n",
  "meta" : { },
  "file" : {
    "extension" : "html",
    "content_type" : "text/html; charset=ISO-8859-1",
    "created" : "2019-11-25T17:24:46.000+0000",
    "last_modified" : "2019-11-25T17:24:46.000+0000",
    "last_accessed" : "2019-11-25T18:17:01.000+0000",
    "indexing_date" : "2019-11-25T18:17:23.198+0000",
    "filesize" : 20,
    "filename" : "footer (copy).html",
    "url" : "file:///tmp/es/footer (copy).html"
  },
  "path" : {
    "root" : "824b64ab42d4b63cda6e747e2b80e5",
    "virtual" : "/footer (copy).html",
    "real" : "/tmp/es/footer (copy).html"
  }
}

What benefits from this changes?
We can skip Tika completely by adding config skip_tika: true|false. So if we know that our purpose doesn't require Tika parsing then we can go ahead with raw content from the file.

Trace (after making this changes)

00:58:10,950 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='footer (copy).html', file=true, directory=false, lastModifiedDate=2019-11-25T23:24:46, creationDate=2019-11-25T23:24:46, accessDate=2019-11-26T00:17:23, path='/tmp/es', owner='shaharia', group='shaharia', permissions=664, extension='html', fullpath='/tmp/es/footer (copy).html', size=20}
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/footer (copy).html], includes = [null], excludes = [[*/~*]]
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], excludes = [[*/~*]]
00:58:10,950 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
00:58:10,950 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], includes = [null]
00:58:10,950 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
00:58:10,950 DEBUG [f.p.e.c.f.FsParserAbstract] [/footer (copy).html] can be indexed: [true]
00:58:10,950 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /footer (copy).html
00:58:10,950 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/tmp/es],[footer (copy).html]
00:58:10,951 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:58:10,951 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
00:58:10,951 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing fscrawler_tmp_es/eeb62f12e25aa977e021de3beff734?pipeline=null
00:58:10,951 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "<img src=\"{$src}\">",
  "meta" : { },
  "file" : {
    "extension" : "html",
    "created" : "2019-11-25T17:24:46.000+0000",
    "last_modified" : "2019-11-25T17:24:46.000+0000",
    "last_accessed" : "2019-11-25T18:17:23.000+0000",
    "indexing_date" : "2019-11-25T18:58:10.951+0000",
    "filesize" : 20,
    "filename" : "footer (copy).html",
    "url" : "file:///tmp/es/footer (copy).html"
  },
  "path" : {
    "root" : "824b64ab42d4b63cda6e747e2b80e5",
    "virtual" : "/footer (copy).html",
    "real" : "/tmp/es/footer (copy).html"
  }
}

Summery
In certain case, Tika limits the scope of use of fscrawler. But with this skip_tika: true we can really use fscrawler on any kinds of file system indexing.

Copy link
Owner

@dadoonet dadoonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks overall good. I left some minor comments.
Could you change that?
Then also please add some tests (I can help if you need) and some documentation for this new setting.

Thanks a lot!

@shahariaazam
Copy link
Contributor Author

The change looks overall good. I left some minor comments.
Could you change that?
Then also please add some tests (I can help if you need) and some documentation for this new setting.

Thanks a lot!

I will change the codes as per your feedback. For tests. Need your help on that. Can you take care of that after I modify this PR?

Also for the documentation I will send another PR after fixing this one.

@dadoonet
Copy link
Owner

Could you please rebase your code on master?

It will most likely remove the first commit as I merged it today and should make Travis-CI happy with your change so we can move forward with tests.

@dadoonet dadoonet changed the title Issue #846 - Tika Parser can be avoided by adding skip_tika Skip Tika parsing with skip_tika new option Nov 26, 2019
@dadoonet dadoonet added the new For new features or options label Nov 26, 2019
@dadoonet dadoonet added this to the 2.7 milestone Nov 26, 2019
@dadoonet dadoonet self-assigned this Nov 26, 2019
@dadoonet
Copy link
Owner

In your commit message, could you also add:

Closes #846  

So when merged, this will close automatically the initial issue.

@shahariaazam
Copy link
Contributor Author

In your commit message, could you also add:

Closes #846  

So when merged, this will close automatically the initial issue.

Sorry to mention that in my latest changes.

@dadoonet
Copy link
Owner

Could you rebase your branch on master and solve the conflicts?

@shahariaazam
Copy link
Contributor Author

Could you rebase your branch on master and solve the conflicts?

Done. Please check.

@dadoonet
Copy link
Owner

I don't think it's a rebase.

Did you really rebase on master?

@shahariaazam
Copy link
Contributor Author

I don't think it's a rebase.

Did you really rebase on master?

Yes it was rebase.

@shahariaazam
Copy link
Contributor Author

I don't think it's a rebase.

Did you really rebase on master?

Did any comment seems missing?

@shahariaazam
Copy link
Contributor Author

I don't think it's a rebase.
Did you really rebase on master?

Did any comment seems missing?

If there is any confusion, can you see this? https://github.com/shahariaazam/fscrawler/commits/issue-846-v2 ? If it's OK, then may be I can push this to this PR again.

@dadoonet
Copy link
Owner

It looks like to me that you merged somewhat the master branch in your branch but did not rebase on master.

@shahariaazam
Copy link
Contributor Author

It looks like to me that you merged somewhat the master branch in your branch but did not rebase on master.

https://github.com/shahariaazam/fscrawler/commits/issue-846-v2 Did you see this?

@shahariaazam
Copy link
Contributor Author

It looks like to me that you merged somewhat the master branch in your branch but did not rebase on master.

I am extremely Sorry @dadoonet . It seems I made a mistake. There has been some merge. I just fixed and rebased on master. It should be perfectly fine now. Please let me know.

Copy link
Owner

@dadoonet dadoonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing to change.
I'm going also to send you a PR to start adding some tests. I'm not sure yet how this will work at the end, specifically because of the integration tests which might fail now. Let see...

@@ -469,6 +469,9 @@ private void indexFile(FileAbstractModel fileAbstractModel, ScanStatistic stats,
} else if (fsSettings.getFs().isXmlSupport()) {
// https://github.com/dadoonet/fscrawler/issues/185 : Support Xml files
doc.setObject(XmlDocParser.generateMap(inputStream));
} else if (fsSettings.getFs().isSkipTika()) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of it today and adding some tests. I found out that we should move this logic in TikaDocParser#generate method.

Two reasons:

  • We can more easily unit test this
  • UploadAPI class (Rest API) must also see this setting. UploadAPI calls generate method. That's why we should call the logic from there.

Could you change that please?

@dadoonet
Copy link
Owner

dadoonet commented Jan 7, 2020

Hey @shahariaazam! Would you like to update your PR?

@dadoonet
Copy link
Owner

@shahariaazam any news?

@dadoonet dadoonet added the wait for feedback Waiting for the user feedback label Jan 17, 2020
@shahariaazam
Copy link
Contributor Author

@shahariaazam any news?

This weekend I will update this PR. Sorry for the delay. Load of work recently. :)

@dadoonet
Copy link
Owner

No worries! I just wanted to make sure you're still interested in this. 😉

@dadoonet
Copy link
Owner

@shahariaazam Any spare time to move this forward?

@dadoonet
Copy link
Owner

Hey! Is that something you would like to bring in?

@dadoonet dadoonet removed this from the 2.7 milestone Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new For new features or options wait for feedback Waiting for the user feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants