Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOMSplitter's children have wrong ContentType #95

Open
mariuspruski opened this issue May 8, 2019 · 1 comment
Open

DOMSplitter's children have wrong ContentType #95

mariuspruski opened this issue May 8, 2019 · 1 comment

Comments

@mariuspruski
Copy link

mariuspruski commented May 8, 2019

Issue Description
I am using two DOMSplitter subsequently in my configuration. The configuration runs on pages with the ContentType text/html. However, the child documents created by the first DOMSplitter are assigned the default ContentType application/octet-stream, even though they are clearly html fragments. As the DOMSplitter only runs on content of type text/html, the second DOMSplitter is completely ignored.

Current Workaround
I have to manually override the ContentType of the child documents with the value text/html.

Suggestion
(1). As far as I see, the children created by the DOMSplitter will always be HTML documents themselves. We can simply initialize them with the ContentType of their parent document. Thus the contentTypeDetector won't have to look at them.
(2). Do not treat the DOM Selector as if it was a filename. The reference of a document is not always a name of a file, so we would need to distinguish cases.

Further information
The child documents created by the DOMSplitter carry no ContentType (=null). For that reason, the automatic ContentType-Detection mechanism is subsequently executed on them (Importer.java, line 227). The contentTypeDetector uses the reference (which will be the DOM Selector previously given to the DOMSplitter) of the document as if it was a filename. It will try to extract a file ending out of a DOMSelector (so it catches any CSS class in the DOM Selector...). This greatly misleads the contentTypeDetector and it ends up classifying the document as application/octet-stream.

@essiembre
Copy link
Contributor

It makes sense to keep the parent content type when splitting an XML/HTML. I supposed there could be odd cases where it could have negative side effects. Like splitting an XML to obtain embedded JSON snippets for example. I agree this would be a nice feature so I am marking it as such.

I would probably make it a flag to either keep or not the parent content type, or maybe even allow to specify it explicitly (since you would likely know what the content type of children, if different).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants