-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for HTML fragments in Document Assembler Content tag #86
Added support for HTML fragments in Document Assembler Content tag #86
Conversation
… and a Tutorial under docs.
Updated DA Multiline content test because the HtmlConverter creates paragraphs rather than inserting soft line breaks.
…nts in a Content select (joins them with a new line). Ran csharpier on all files. Added HtmlAgilityPack to paket references.
…t rather than using paket initially!
|
||
namespace Clippit.Word.Assembler | ||
{ | ||
internal static class HtmlConverter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PowerTools already has an HTML to Wml converter that supports a wide range of HTML tags, css styles, fonts, ol/ui lists, etc.
https://github.com/sergey-tihon/Clippit/blob/master/Clippit/Html/HtmlToWmlConverterCore.cs#L184-L190
It seems like it currently builds the whole document, but maybe we can refactor it and reuse it from the document assembler?
@MalcolmJohnston what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sergey-tihon thanks for coming back to me.
When I first added the support for simple HTML in DocumentAssembler I did consider whether to use the HtmlToWmlConverter.
The reason I didn't go with that approach was partially, because I had a hard time grasping all of the HTML to Wml Converter code, but also because I think the use-cases are different.
As you say currently the Html to Wml Convert is designed for entire documents, and it has a lot of clever functionality for dealing with styles and fonts so that you get a high fidelity conversion.
With the Document Assembler use case I think that we do not need that, and actually it could be retrograde. The reason is that with Document Assembler we are typically dealing with highly structured documents with specific document styles. So we wouldn't want the fonts and styles to be altered in the way that Html to Wml Converter does.
If we could find a way to bring HtmlConverter.cs more inline with Html to Wml Converter though then I could see the benefit of that.
Not sure that helps a lot, but some thoughts :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not mind keeping two implementations for a while (at least till we refactor the old one and make it universal and usable for both use cases)
However, I do like that we need to add a dependency on HtmlAgilityPack
just for this simpler implementation. As a nugget package author, we need to depend on as low a version of HtmlAgilityPack
as possible to minimize the change of version conflict with another package that has a direct or transitive dependency on this package. Same time we need to keep track of all security vulnerabilities inside HtmlAgilityPack
and bump versions when they are fixed. Can we leave with something like XElement xhtml
for a while (as it done inside HtmlToWmlConverter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for using HtmlAgilityPack
is to handle HTML to XML conversion. The use-case for me is that we are sometimes pulling through HTML
that is not well formed or at least it is not XHTML
so that is the reason for adding HtmlAgilityPack
.
I understand your concerns about adding dependencies though, I guess the argument can be made that the caller should ensure that the HTML
is XHTML
compliant when using this library, and so we only support XHTML
in Clippit?
That would remove the requirement for HtmlAgilityPack
being a dependency. It would also make the conversion of my existing HtmlConverter
code to something closer to HtmlToWmlConvertorCore
simpler in future.
Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the argument can be made that the caller should ensure that the HTML is XHTML compliant when using this library, so we only support XHTML in Clippit?
Yes, exactly, the caller will have a choice:
- When
HTML
is known during compile time, it may be easy enough to guarantee that it isXHTML
compliant and does not pay runtime costs. - Add extra dependency to
HtmlAgilityPack
or equivalent, and support nonXHTML
compliantHTML
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so the action here is to remove HtmlAgilityPack and update the implementation accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Sergey,
Apologies for the delay in getting back to you. I had another look at this and decided to have a go at refactoring the HtmlConverter code so that it is closer to HtmlToWmlConverterCore.
This is not a complete refactoring as there is a fair amount of work to do, however it does start us along the journey.
What I have done is to create a trimmed down version of the Transform method from HtmlConverterCore in HtmlConverter. To facilitate this I have made a number of methods and classes internal
rather than private
.
The main difference between the two methods are:
- We do not support all elements in the HtmlConverter version
- We do not apply paragraph level styles in the HtmlConverter version (because these should be taken from the template)
The implementation as it is adds support for sub
, sup
and s
HTML elements which is an improvement on before. I also think it provides a pathway towards additional refactoring and further support for HTML tags in HtmlConverter.
I am not intending to do these now, but some of the next steps that would be required are:
- Decide on a mechansim for not applying Paragraph properties in the Transform method (use existing settings class?)
- Understand and add support for OL/LI tags (requires some pre-processing based on HtmlToWmlConverterCore
- Understand how CSS application works and add to HtmlConverter
- Refactor the Transform method so that it can take a "specification" of the elements that are supported, this could also be considered as an extensibility mechanism i.e. callers could override the default behaviour if needed
This should bring us a lot closer to a common code base for HTML processing which works both for entire documents and for document fragments as required by DocumentAssembler.
Let me know your thoughts.
As an aside did you see that https://github.com/OpenXmlDev/Open-Xml-PowerTools is looking for a new maintainer? I thought it might be worth approaching them and suggesting they forward people onto Clippit instead? As a lot of work has already been done in this repository.
Cheers,
Malcolm
Added tests for HTML content (not supported) and non-well formed XHTML. Looking at possibility of replacing XMLReader approach with something closer to HtmlToWmlConverter.
…-HTML-support-via-HtmlToWmlConverter Feature/document assembler html support via html to wml converter
@@ -1003,7 +965,7 @@ OpenXmlPart part | |||
// assign unique image and paragraph ids. Image id is document property Id (wp:docPr) | |||
// and relationship id is rId. Their numbering is different. | |||
const string imageId = InvalidImageId; // Ids will be replaced with real ones later, after transform is done | |||
var relationshipId = GetNextImageRelationshipId(part); | |||
var relationshipId = Relationships.GetNewRelationshipId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MalcolmJohnston just wondering why you decided to use random generated id instead of deterministic/sequential one?
Released in NuGet Gallery | Clippit 2.4.0 |
Hello Sergey,
Hope you are well and thanks for merging my previous PRs.
This PR adds support for using HTML fragments with the Content tag in Document Assembler. The supported inline tags are:
Both div and p tags are supported block level tags, any other HTML tag will simply be ignored. Tags can be combined so for example the following works:
<a href="bbc.co.uk"><b><i><u>BBC</u></i></b></a>
I have added a new test, and a tutorial to the docs site.
This resolves #70 in my opinion although as noted in the Tutorial there are some other tags that we could potentially add support for.
One thing to note is that this adds a dependency on the HTMLAgilityPack.
Any questions then let me know, I'll hopefully do another PR later in the week for supporting pulling in Document/Templates from other sources.
Cheers,
Malcolm