Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip_tika option in job configuration is ignored #979

Open
aram535 opened this issue Jul 13, 2020 · 5 comments
Open

skip_tika option in job configuration is ignored #979

aram535 opened this issue Jul 13, 2020 · 5 comments
Labels
check_for_bug Needs to be reproduced

Comments

@aram535
Copy link
Contributor

aram535 commented Jul 13, 2020

Describe the bug

While scanning text files, app is chucking the file as an invalid XML file that is badly formatted. I figured it must be tika mis-interpreting the file, I tried to turn off tika with "skip_tika" configuration according to #846 but that did not work. (Restarted scan with --restart). The file is not fully searchable in Kibana and is not included in the result set. (Tested with a small file sample).

Job Settings

name: "idx"
fs:
  url: "/fs/archive/files/"
  update_rate: "60m"
  includes:
    - "*/*.txt"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: true
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
  skip_tika: true
elasticsearch:
  nodes:
    - url: "http://192.168.1.5:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Logs

06:05:25,072 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/fs/archive/files/13613.txt]  -> XML parse error -> The markup in the document following the root element must be well-formed.

Expected behavior

The full content of file should be indexed and searchable.

Versions:

  • OS: Linux (CentOS 8)
  • Version: fscrawler-es7-2.7-SNAPSHOT ... ES: 7.8.0
@aram535 aram535 added the check_for_bug Needs to be reproduced label Jul 13, 2020
@dadoonet
Copy link
Owner

Could you try to change this setting to -1 ? https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#extracted-characters

@aram535
Copy link
Contributor Author

aram535 commented Jul 13, 2020

Set it to -1:
"Failed to extract [-1] characters of text for .... "

I also tried setting the value to 100%:
"Failed to extract [25376] characters of text for ... "

Oddly enough it's only some files .... they're all the same permissions, same owner/group (same user as the one running the app), no ACLs. Even SELinux is disabled on this system.

EDIT: Some of the files are quiet small so it's not a size problem:

08:13:15,135 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [9023] characters of text for [/fs/archive/files/33213.txt]  -> XML parse error -> The markup in the document following the root element must be well-formed.
08:13:15,135 WARN  [f.p.e.c.f.FsParserAbstract] trying to add new file while closing crawler. Document [idx]/[f3211c7cacf16e729944445afff9242b] has been ignored
^C
$ wc -l /fs/archive/files/33213.txt
143 /fs/archive/files/33213.txt
$ wc -c /fs/archive/files/33213.txt
9023 /fs/archive/files/33213.txt

EDIT #2:
There might be something weird with the "content" of the file. I think it's interpreting the email headers in the beginning of the file as "XML". I removed all the headers from the file and restarted crawler and the WARN disappeared.

@dadoonet
Copy link
Owner

Could you run with the --debug option?

@aram535
Copy link
Contributor Author

aram535 commented Jul 13, 2020

It is Tika that's doing it .... now why is Tika looking at a foo.txt with text inside and seeing XML I have no idea. Doesn't the skip_tika option completely skip using tika to identify a file?

09:03:34,379 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [8690] characters of text for ..... org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:81) ~[tika-parsers-1.24.1.jar:1.24.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.24.1.jar:1.24.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.24.1.jar:1.24.1] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.24.1.jar:1.24.1]

In the meantime I'll download Tika as a standalone and see what it's doing ...

@aram535 aram535 changed the title XML Parse error on txt files skip_tika option in job configuration is ignored Jul 13, 2020
@aram535
Copy link
Contributor Author

aram535 commented Jul 13, 2020

Okay, 100% it is Tika that's doing it ...

public class TikaTest {
    public static void main(String[] args) throws Exception {
        Path path = Paths.get(args[0]);
        File f = path.toFile();

        TikaConfig tika = new TikaConfig();
        Metadata metadata = new Metadata();
        metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
        System.out.println("File " + f + " is " + tika.getDetector().detect(TikaInputStream.get(f), metadata));
    }
}

... 1.txt is the plain email with header
... 2.txt is just the body of the email (removed the top 12 lines)

$ java -cp .:./TikaTest.class:./lib/tika-core-1.24.1.jar TikaTest ./1.txt
File ./1.txt is application/xml
$ java -cp .:./TikaTest.class:./lib/tika-core-1.24.1.jar TikaTest ./2.txt
File ./2.txt is text/plain

We can turn this into a bug that "skip_tika" is being ignored in the configuration. I'll work out where the problem with Tika is with them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

2 participants