You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Description
I am using two DOMSplitter subsequently in my configuration. The configuration runs on pages with the ContentType text/html. However, the child documents created by the first DOMSplitter are assigned the default ContentType application/octet-stream, even though they are clearly html fragments. As the DOMSplitter only runs on content of type text/html, the second DOMSplitter is completely ignored.
Current Workaround
I have to manually override the ContentType of the child documents with the value text/html.
Suggestion
(1). As far as I see, the children created by the DOMSplitter will always be HTML documents themselves. We can simply initialize them with the ContentType of their parent document. Thus the contentTypeDetector won't have to look at them.
(2). Do not treat the DOM Selector as if it was a filename. The reference of a document is not always a name of a file, so we would need to distinguish cases.
Further information
The child documents created by the DOMSplitter carry no ContentType (=null). For that reason, the automatic ContentType-Detection mechanism is subsequently executed on them (Importer.java, line 227). The contentTypeDetector uses the reference (which will be the DOM Selector previously given to the DOMSplitter) of the document as if it was a filename. It will try to extract a file ending out of a DOMSelector (so it catches any CSS class in the DOM Selector...). This greatly misleads the contentTypeDetector and it ends up classifying the document as application/octet-stream.
The text was updated successfully, but these errors were encountered:
It makes sense to keep the parent content type when splitting an XML/HTML. I supposed there could be odd cases where it could have negative side effects. Like splitting an XML to obtain embedded JSON snippets for example. I agree this would be a nice feature so I am marking it as such.
I would probably make it a flag to either keep or not the parent content type, or maybe even allow to specify it explicitly (since you would likely know what the content type of children, if different).
Issue Description
I am using two DOMSplitter subsequently in my configuration. The configuration runs on pages with the ContentType text/html. However, the child documents created by the first DOMSplitter are assigned the default ContentType application/octet-stream, even though they are clearly html fragments. As the DOMSplitter only runs on content of type text/html, the second DOMSplitter is completely ignored.
Current Workaround
I have to manually override the ContentType of the child documents with the value text/html.
Suggestion
(1). As far as I see, the children created by the DOMSplitter will always be HTML documents themselves. We can simply initialize them with the ContentType of their parent document. Thus the contentTypeDetector won't have to look at them.
(2). Do not treat the DOM Selector as if it was a filename. The reference of a document is not always a name of a file, so we would need to distinguish cases.
Further information
The child documents created by the DOMSplitter carry no ContentType (=null). For that reason, the automatic ContentType-Detection mechanism is subsequently executed on them (Importer.java, line 227). The contentTypeDetector uses the reference (which will be the DOM Selector previously given to the DOMSplitter) of the document as if it was a filename. It will try to extract a file ending out of a DOMSelector (so it catches any CSS class in the DOM Selector...). This greatly misleads the contentTypeDetector and it ends up classifying the document as application/octet-stream.
The text was updated successfully, but these errors were encountered: