DOMSplitter's children have wrong ContentType #95

mariuspruski · 2019-05-08T16:42:25Z

Issue Description
I am using two DOMSplitter subsequently in my configuration. The configuration runs on pages with the ContentType text/html. However, the child documents created by the first DOMSplitter are assigned the default ContentType application/octet-stream, even though they are clearly html fragments. As the DOMSplitter only runs on content of type text/html, the second DOMSplitter is completely ignored.

Current Workaround
I have to manually override the ContentType of the child documents with the value text/html.

Suggestion
(1). As far as I see, the children created by the DOMSplitter will always be HTML documents themselves. We can simply initialize them with the ContentType of their parent document. Thus the contentTypeDetector won't have to look at them.
(2). Do not treat the DOM Selector as if it was a filename. The reference of a document is not always a name of a file, so we would need to distinguish cases.

Further information
The child documents created by the DOMSplitter carry no ContentType (=null). For that reason, the automatic ContentType-Detection mechanism is subsequently executed on them (Importer.java, line 227). The contentTypeDetector uses the reference (which will be the DOM Selector previously given to the DOMSplitter) of the document as if it was a filename. It will try to extract a file ending out of a DOMSelector (so it catches any CSS class in the DOM Selector...). This greatly misleads the contentTypeDetector and it ends up classifying the document as application/octet-stream.

essiembre · 2019-05-14T18:29:26Z

It makes sense to keep the parent content type when splitting an XML/HTML. I supposed there could be odd cases where it could have negative side effects. Like splitting an XML to obtain embedded JSON snippets for example. I agree this would be a nice feature so I am marking it as such.

I would probably make it a flag to either keep or not the parent content type, or maybe even allow to specify it explicitly (since you would likely know what the content type of children, if different).

essiembre added the feature-request label May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOMSplitter's children have wrong ContentType #95

DOMSplitter's children have wrong ContentType #95

mariuspruski commented May 8, 2019 •

edited

Loading

essiembre commented May 14, 2019

DOMSplitter's children have wrong ContentType #95

DOMSplitter's children have wrong ContentType #95

Comments

mariuspruski commented May 8, 2019 • edited Loading

essiembre commented May 14, 2019

mariuspruski commented May 8, 2019 •

edited

Loading