Skip Tika parsing with `skip_tika` new option #858

shahariaazam · 2019-11-25T19:18:57Z

As we were continuing the discussion on #846. I just made the changes. In my case, it's working as initially it was proposed.

Here is the outcome for this changes.

<!-- File: footer (copy).html -->
<img src="{$src}">

What happened before this changes?
Tika can't parse this file content as it's not valid HTML. This similar case can happen if you specially work .

Trace (before making this changes)

00:17:23,197 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='footer (copy).html', file=true, directory=false, lastModifiedDate=2019-11-25T23:24:46, creationDate=2019-11-25T23:24:46, accessDate=2019-11-26T00:17:01, path='/tmp/es', owner='shaharia', group='shaharia', permissions=664, extension='html', fullpath='/tmp/es/footer (copy).html', size=20}
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/footer (copy).html], includes = [null], excludes = [[*/~*]]
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], excludes = [[*/~*]]
00:17:23,198 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
00:17:23,198 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], includes = [null]
00:17:23,198 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
00:17:23,198 DEBUG [f.p.e.c.f.FsParserAbstract] [/footer (copy).html] can be indexed: [true]
00:17:23,198 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /footer (copy).html
00:17:23,198 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/tmp/es],[footer (copy).html]
00:17:23,198 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:17:23,198 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/tmp/es/footer (copy).html]
00:17:23,198 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
00:17:23,390 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
00:17:23,390 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
00:17:23,390 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
00:17:23,390 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing fscrawler_tmp_es/eeb62f12e25aa977e021de3beff734?pipeline=null
00:17:23,390 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "\n\n",
  "meta" : { },
  "file" : {
    "extension" : "html",
    "content_type" : "text/html; charset=ISO-8859-1",
    "created" : "2019-11-25T17:24:46.000+0000",
    "last_modified" : "2019-11-25T17:24:46.000+0000",
    "last_accessed" : "2019-11-25T18:17:01.000+0000",
    "indexing_date" : "2019-11-25T18:17:23.198+0000",
    "filesize" : 20,
    "filename" : "footer (copy).html",
    "url" : "file:///tmp/es/footer (copy).html"
  },
  "path" : {
    "root" : "824b64ab42d4b63cda6e747e2b80e5",
    "virtual" : "/footer (copy).html",
    "real" : "/tmp/es/footer (copy).html"
  }
}

What benefits from this changes?
We can skip Tika completely by adding config skip_tika: true|false. So if we know that our purpose doesn't require Tika parsing then we can go ahead with raw content from the file.

Trace (after making this changes)

00:58:10,950 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='footer (copy).html', file=true, directory=false, lastModifiedDate=2019-11-25T23:24:46, creationDate=2019-11-25T23:24:46, accessDate=2019-11-26T00:17:23, path='/tmp/es', owner='shaharia', group='shaharia', permissions=664, extension='html', fullpath='/tmp/es/footer (copy).html', size=20}
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/footer (copy).html], includes = [null], excludes = [[*/~*]]
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], excludes = [[*/~*]]
00:58:10,950 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
00:58:10,950 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
00:58:10,950 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/footer (copy).html], includes = [null]
00:58:10,950 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
00:58:10,950 DEBUG [f.p.e.c.f.FsParserAbstract] [/footer (copy).html] can be indexed: [true]
00:58:10,950 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /footer (copy).html
00:58:10,950 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/tmp/es],[footer (copy).html]
00:58:10,951 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/tmp/es, /tmp/es/footer (copy).html) = /footer (copy).html
00:58:10,951 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
00:58:10,951 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing fscrawler_tmp_es/eeb62f12e25aa977e021de3beff734?pipeline=null
00:58:10,951 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "<img src=\"{$src}\">",
  "meta" : { },
  "file" : {
    "extension" : "html",
    "created" : "2019-11-25T17:24:46.000+0000",
    "last_modified" : "2019-11-25T17:24:46.000+0000",
    "last_accessed" : "2019-11-25T18:17:23.000+0000",
    "indexing_date" : "2019-11-25T18:58:10.951+0000",
    "filesize" : 20,
    "filename" : "footer (copy).html",
    "url" : "file:///tmp/es/footer (copy).html"
  },
  "path" : {
    "root" : "824b64ab42d4b63cda6e747e2b80e5",
    "virtual" : "/footer (copy).html",
    "real" : "/tmp/es/footer (copy).html"
  }
}

Summery
In certain case, Tika limits the scope of use of fscrawler. But with this skip_tika: true we can really use fscrawler on any kinds of file system indexing.

dadoonet

The change looks overall good. I left some minor comments.
Could you change that?
Then also please add some tests (I can help if you need) and some documentation for this new setting.

Thanks a lot!

core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java

distribution/src/main/scripts/fscrawler

distribution/src/main/scripts/fscrawler.bat

shahariaazam · 2019-11-26T11:26:08Z

The change looks overall good. I left some minor comments.
Could you change that?
Then also please add some tests (I can help if you need) and some documentation for this new setting.

Thanks a lot!

I will change the codes as per your feedback. For tests. Need your help on that. Can you take care of that after I modify this PR?

Also for the documentation I will send another PR after fixing this one.

dadoonet · 2019-11-26T15:52:41Z

Could you please rebase your code on master?

It will most likely remove the first commit as I merged it today and should make Travis-CI happy with your change so we can move forward with tests.

dadoonet · 2019-11-26T15:55:07Z

In your commit message, could you also add:

Closes #846

So when merged, this will close automatically the initial issue.

shahariaazam · 2019-11-26T21:27:15Z

In your commit message, could you also add:
Closes #846  
So when merged, this will close automatically the initial issue.

Sorry to mention that in my latest changes.

dadoonet · 2019-11-29T16:55:02Z

Could you rebase your branch on master and solve the conflicts?

shahariaazam · 2019-11-30T16:26:23Z

Could you rebase your branch on master and solve the conflicts?

Done. Please check.

dadoonet · 2019-11-30T17:25:32Z

I don't think it's a rebase.

Did you really rebase on master?

shahariaazam · 2019-11-30T17:26:15Z

I don't think it's a rebase.

Did you really rebase on master?

Yes it was rebase.

shahariaazam · 2019-11-30T17:26:51Z

I don't think it's a rebase.

Did you really rebase on master?

Did any comment seems missing?

…y default skip_tika: false of course. Relevant Issue dadoonet#846

shahariaazam · 2019-11-30T17:35:21Z

I don't think it's a rebase.
Did you really rebase on master?

Did any comment seems missing?

If there is any confusion, can you see this? https://github.com/shahariaazam/fscrawler/commits/issue-846-v2 ? If it's OK, then may be I can push this to this PR again.

dadoonet · 2019-11-30T17:40:45Z

It looks like to me that you merged somewhat the master branch in your branch but did not rebase on master.

shahariaazam · 2019-11-30T17:41:22Z

It looks like to me that you merged somewhat the master branch in your branch but did not rebase on master.

https://github.com/shahariaazam/fscrawler/commits/issue-846-v2 Did you see this?

shahariaazam · 2019-11-30T17:52:49Z

It looks like to me that you merged somewhat the master branch in your branch but did not rebase on master.

I am extremely Sorry @dadoonet . It seems I made a mistake. There has been some merge. I just fixed and rebased on master. It should be perfectly fine now. Please let me know.

dadoonet

One last thing to change.
I'm going also to send you a PR to start adding some tests. I'm not sure yet how this will work at the end, specifically because of the integration tests which might fail now. Let see...

dadoonet · 2019-12-09T07:37:48Z

core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java

@@ -469,6 +469,9 @@ private void indexFile(FileAbstractModel fileAbstractModel, ScanStatistic stats,
                } else if (fsSettings.getFs().isXmlSupport()) {
                    // https://github.com/dadoonet/fscrawler/issues/185 : Support Xml files
                    doc.setObject(XmlDocParser.generateMap(inputStream));
+                } else if (fsSettings.getFs().isSkipTika()) {


I was thinking of it today and adding some tests. I found out that we should move this logic in TikaDocParser#generate method.

Two reasons:

We can more easily unit test this

UploadAPI class (Rest API) must also see this setting. UploadAPI calls generate method. That's why we should call the logic from there.

Could you change that please?

dadoonet · 2020-01-07T10:32:29Z

Hey @shahariaazam! Would you like to update your PR?

dadoonet · 2020-01-17T07:36:38Z

@shahariaazam any news?

shahariaazam · 2020-01-22T05:59:29Z

@shahariaazam any news?

This weekend I will update this PR. Sorry for the delay. Load of work recently. :)

dadoonet · 2020-01-22T06:04:13Z

No worries! I just wanted to make sure you're still interested in this. 😉

dadoonet · 2020-02-26T19:34:34Z

@shahariaazam Any spare time to move this forward?

dadoonet · 2020-12-23T10:16:55Z

Hey! Is that something you would like to bring in?

shahariaazam mentioned this pull request Nov 25, 2019

Can we ignore parsing content at all by adding config key in settings? #846

Open

dadoonet requested changes Nov 26, 2019

View reviewed changes

dadoonet changed the title ~~Issue #846 - Tika Parser can be avoided by adding skip_tika~~ Skip Tika parsing with skip_tika new option Nov 26, 2019

dadoonet added the new For new features or options label Nov 26, 2019

dadoonet added this to the 2.7 milestone Nov 26, 2019

dadoonet self-assigned this Nov 26, 2019

shahariaazam requested a review from dadoonet November 26, 2019 21:28

shahariaazam added 3 commits November 30, 2019 23:32

Tika parser can be avoided by adding skip_tika: true in the config. B…

949b333

…y default skip_tika: false of course. Relevant Issue dadoonet#846

Some minor coding standards fix as per PR review

dd38903

skip_tika config added in the documentation

515dae0

shahariaazam force-pushed the issue-846 branch from de68a7d to 515dae0 Compare November 30, 2019 17:50

dadoonet requested changes Dec 9, 2019

View reviewed changes

dadoonet added the wait for feedback Waiting for the user feedback label Jan 17, 2020

janhoy mentioned this pull request Sep 6, 2020

Processing pipeline support #1004

Draft

dadoonet removed this from the 2.7 milestone Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip Tika parsing with `skip_tika` new option #858

Skip Tika parsing with `skip_tika` new option #858

shahariaazam commented Nov 25, 2019 •

edited

Loading

dadoonet left a comment

shahariaazam commented Nov 26, 2019

dadoonet commented Nov 26, 2019

dadoonet commented Nov 26, 2019

shahariaazam commented Nov 26, 2019

dadoonet commented Nov 29, 2019

shahariaazam commented Nov 30, 2019

dadoonet commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

dadoonet commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

dadoonet left a comment

dadoonet Dec 9, 2019

dadoonet commented Jan 7, 2020

dadoonet commented Jan 17, 2020

shahariaazam commented Jan 22, 2020

dadoonet commented Jan 22, 2020

dadoonet commented Feb 26, 2020

dadoonet commented Dec 23, 2020

Skip Tika parsing with skip_tika new option #858

Are you sure you want to change the base?

Skip Tika parsing with skip_tika new option #858

Conversation

shahariaazam commented Nov 25, 2019 • edited Loading

dadoonet left a comment

Choose a reason for hiding this comment

shahariaazam commented Nov 26, 2019

dadoonet commented Nov 26, 2019

dadoonet commented Nov 26, 2019

shahariaazam commented Nov 26, 2019

dadoonet commented Nov 29, 2019

shahariaazam commented Nov 30, 2019

dadoonet commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

dadoonet commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

shahariaazam commented Nov 30, 2019

dadoonet left a comment

Choose a reason for hiding this comment

dadoonet Dec 9, 2019

Choose a reason for hiding this comment

dadoonet commented Jan 7, 2020

dadoonet commented Jan 17, 2020

shahariaazam commented Jan 22, 2020

dadoonet commented Jan 22, 2020

dadoonet commented Feb 26, 2020

dadoonet commented Dec 23, 2020

Skip Tika parsing with `skip_tika` new option #858

Skip Tika parsing with `skip_tika` new option #858

shahariaazam commented Nov 25, 2019 •

edited

Loading