-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip_tika option in job configuration is ignored #979
Comments
Could you try to change this setting to -1 ? https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#extracted-characters |
Set it to -1: I also tried setting the value to 100%: Oddly enough it's only some files .... they're all the same permissions, same owner/group (same user as the one running the app), no ACLs. Even SELinux is disabled on this system. EDIT: Some of the files are quiet small so it's not a size problem:
EDIT #2: |
Could you run with the |
It is Tika that's doing it .... now why is Tika looking at a foo.txt with text inside and seeing XML I have no idea. Doesn't the skip_tika option completely skip using tika to identify a file?
In the meantime I'll download Tika as a standalone and see what it's doing ... |
Okay, 100% it is Tika that's doing it ...
... 1.txt is the plain email with header $ java -cp .:./TikaTest.class:./lib/tika-core-1.24.1.jar TikaTest ./1.txt We can turn this into a bug that "skip_tika" is being ignored in the configuration. I'll work out where the problem with Tika is with them. |
Describe the bug
While scanning text files, app is chucking the file as an invalid XML file that is badly formatted. I figured it must be tika mis-interpreting the file, I tried to turn off tika with "skip_tika" configuration according to #846 but that did not work. (Restarted scan with --restart). The file is not fully searchable in Kibana and is not included in the result set. (Tested with a small file sample).
Job Settings
Logs
Expected behavior
The full content of file should be indexed and searchable.
Versions:
The text was updated successfully, but these errors were encountered: