Added support for HTML fragments in Document Assembler Content tag #86

MalcolmJohnston · 2024-11-11T13:37:13Z

Hello Sergey,

Hope you are well and thanks for merging my previous PRs.

This PR adds support for using HTML fragments with the Content tag in Document Assembler. The supported inline tags are:

b/strong
i/em
u
a
br

Both div and p tags are supported block level tags, any other HTML tag will simply be ignored. Tags can be combined so for example the following works:

<a href="bbc.co.uk">BBC</a>

I have added a new test, and a tutorial to the docs site.

This resolves #70 in my opinion although as noted in the Tutorial there are some other tags that we could potentially add support for.

One thing to note is that this adds a dependency on the HTMLAgilityPack.

Any questions then let me know, I'll hopefully do another PR later in the week for supporting pulling in Document/Templates from other sources.

Cheers,
Malcolm

… and a Tutorial under docs.

Updated DA Multiline content test because the HtmlConverter creates paragraphs rather than inserting soft line breaks.

…nts in a Content select (joins them with a new line). Ran csharpier on all files. Added HtmlAgilityPack to paket references.

…t rather than using paket initially!

sergey-tihon · 2024-11-17T18:55:40Z

Clippit/Word/Assembler/HtmlConverter.cs

+
+namespace Clippit.Word.Assembler
+{
+    internal static class HtmlConverter


PowerTools already has an HTML to Wml converter that supports a wide range of HTML tags, css styles, fonts, ol/ui lists, etc.

https://github.com/sergey-tihon/Clippit/blob/master/Clippit/Html/HtmlToWmlConverterCore.cs#L184-L190

It seems like it currently builds the whole document, but maybe we can refactor it and reuse it from the document assembler?

@MalcolmJohnston what do you think?

@sergey-tihon thanks for coming back to me.

When I first added the support for simple HTML in DocumentAssembler I did consider whether to use the HtmlToWmlConverter.

The reason I didn't go with that approach was partially, because I had a hard time grasping all of the HTML to Wml Converter code, but also because I think the use-cases are different.

As you say currently the Html to Wml Convert is designed for entire documents, and it has a lot of clever functionality for dealing with styles and fonts so that you get a high fidelity conversion.

With the Document Assembler use case I think that we do not need that, and actually it could be retrograde. The reason is that with Document Assembler we are typically dealing with highly structured documents with specific document styles. So we wouldn't want the fonts and styles to be altered in the way that Html to Wml Converter does.

If we could find a way to bring HtmlConverter.cs more inline with Html to Wml Converter though then I could see the benefit of that.

Not sure that helps a lot, but some thoughts :)

I do not mind keeping two implementations for a while (at least till we refactor the old one and make it universal and usable for both use cases)

However, I do like that we need to add a dependency on HtmlAgilityPack just for this simpler implementation. As a nugget package author, we need to depend on as low a version of HtmlAgilityPack as possible to minimize the change of version conflict with another package that has a direct or transitive dependency on this package. Same time we need to keep track of all security vulnerabilities inside HtmlAgilityPack and bump versions when they are fixed. Can we leave with something like XElement xhtml for a while (as it done inside HtmlToWmlConverter)?

The reason for using HtmlAgilityPack is to handle HTML to XML conversion. The use-case for me is that we are sometimes pulling through HTML that is not well formed or at least it is not XHTML so that is the reason for adding HtmlAgilityPack.

I understand your concerns about adding dependencies though, I guess the argument can be made that the caller should ensure that the HTML is XHTML compliant when using this library, and so we only support XHTML in Clippit?

That would remove the requirement for HtmlAgilityPack being a dependency. It would also make the conversion of my existing HtmlConverter code to something closer to HtmlToWmlConvertorCore simpler in future.

Let me know your thoughts.

I guess the argument can be made that the caller should ensure that the HTML is XHTML compliant when using this library, so we only support XHTML in Clippit?

Yes, exactly, the caller will have a choice:

When HTML is known during compile time, it may be easy enough to guarantee that it is XHTML compliant and does not pay runtime costs.

Add extra dependency to HtmlAgilityPack or equivalent, and support non XHTML compliant HTML.

Ok so the action here is to remove HtmlAgilityPack and update the implementation accordingly?

yes, thank you

Hello Sergey,

Apologies for the delay in getting back to you. I had another look at this and decided to have a go at refactoring the HtmlConverter code so that it is closer to HtmlToWmlConverterCore.

This is not a complete refactoring as there is a fair amount of work to do, however it does start us along the journey.

What I have done is to create a trimmed down version of the Transform method from HtmlConverterCore in HtmlConverter. To facilitate this I have made a number of methods and classes internal rather than private.

The main difference between the two methods are:

We do not support all elements in the HtmlConverter version

We do not apply paragraph level styles in the HtmlConverter version (because these should be taken from the template)

The implementation as it is adds support for sub, sup and s HTML elements which is an improvement on before. I also think it provides a pathway towards additional refactoring and further support for HTML tags in HtmlConverter.

I am not intending to do these now, but some of the next steps that would be required are:

Decide on a mechansim for not applying Paragraph properties in the Transform method (use existing settings class?)

Understand and add support for OL/LI tags (requires some pre-processing based on HtmlToWmlConverterCore

Understand how CSS application works and add to HtmlConverter

Refactor the Transform method so that it can take a "specification" of the elements that are supported, this could also be considered as an extensibility mechanism i.e. callers could override the default behaviour if needed

This should bring us a lot closer to a common code base for HTML processing which works both for entire documents and for document fragments as required by DocumentAssembler.

Let me know your thoughts.

As an aside did you see that https://github.com/OpenXmlDev/Open-Xml-PowerTools is looking for a new maintainer? I thought it might be worth approaching them and suggesting they forward people onto Clippit instead? As a lot of work has already been done in this repository.

Cheers,
Malcolm

Added tests for HTML content (not supported) and non-well formed XHTML. Looking at possibility of replacing XMLReader approach with something closer to HtmlToWmlConverter.

Clippit/Word/Assembler/HtmlConverter.cs

…-HTML-support-via-HtmlToWmlConverter Feature/document assembler html support via html to wml converter

sergey-tihon · 2024-12-08T19:12:56Z

Clippit/Word/DocumentAssembler.cs

@@ -1003,7 +965,7 @@ OpenXmlPart part
            // assign unique image and paragraph ids. Image id is document property Id  (wp:docPr)
            // and relationship id is rId. Their numbering is different.
            const string imageId = InvalidImageId; // Ids will be replaced with real ones later, after transform is done
-            var relationshipId = GetNextImageRelationshipId(part);
+            var relationshipId = Relationships.GetNewRelationshipId();


@MalcolmJohnston just wondering why you decided to use random generated id instead of deterministic/sequential one?

sergey-tihon · 2024-12-08T19:39:47Z

Released in NuGet Gallery | Clippit 2.4.0

MalcolmJohnston and others added 6 commits November 11, 2024 14:30

Added inline HTML support in Document Assembler with associated tests…

69704e9

… and a Tutorial under docs.

Ran csharpier!

4199871

Updated DA Multiline content test because the HtmlConverter creates paragraphs rather than inserting soft line breaks.

Fixed behaviour of HtmlConverter to allow returning of multiple eleme…

562f5cd

…nts in a Content select (joins them with a new line). Ran csharpier on all files. Added HtmlAgilityPack to paket references.

Fixed issue with duplicate package references as I had added via nuge…

e7198e6

…t rather than using paket initially!

feat: merge master

676acd7

fix: no more paket

ae73d4f

sergey-tihon reviewed Nov 17, 2024

View reviewed changes

MalcolmJohnston added 2 commits November 19, 2024 17:42

Removed HTML Agility Pack.

2605f71

Added tests for HTML content (not supported) and non-well formed XHTML. Looking at possibility of replacing XMLReader approach with something closer to HtmlToWmlConverter.

Work in progress. Builds!

9dcc093

sergey-tihon reviewed Nov 22, 2024

View reviewed changes

Clippit/Word/Assembler/HtmlConverter.cs Outdated Show resolved Hide resolved

MalcolmJohnston and others added 3 commits November 26, 2024 17:24

Updated to use HtmlToWmlConverter type approach.

33499a5

Merge pull request #1 from MalcolmJohnston/feature/document-assembler…

5c013d8

…-HTML-support-via-HtmlToWmlConverter Feature/document assembler html support via html to wml converter

Csharpier fixes.

761ec97

sergey-tihon reviewed Dec 8, 2024

View reviewed changes

sergey-tihon merged commit ab3be1f into sergey-tihon:master Dec 8, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for HTML fragments in Document Assembler Content tag #86

Added support for HTML fragments in Document Assembler Content tag #86

MalcolmJohnston commented Nov 11, 2024 •

edited

Loading

sergey-tihon Nov 17, 2024 •

edited

Loading

MalcolmJohnston Nov 17, 2024

sergey-tihon Nov 17, 2024

MalcolmJohnston Nov 18, 2024 •

edited

Loading

sergey-tihon Nov 18, 2024

MalcolmJohnston Nov 18, 2024

sergey-tihon Nov 18, 2024

MalcolmJohnston Nov 26, 2024

sergey-tihon Dec 8, 2024

sergey-tihon commented Dec 8, 2024

Added support for HTML fragments in Document Assembler Content tag #86

Added support for HTML fragments in Document Assembler Content tag #86

Conversation

MalcolmJohnston commented Nov 11, 2024 • edited Loading

sergey-tihon Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

MalcolmJohnston Nov 17, 2024

Choose a reason for hiding this comment

sergey-tihon Nov 17, 2024

Choose a reason for hiding this comment

MalcolmJohnston Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

sergey-tihon Nov 18, 2024

Choose a reason for hiding this comment

MalcolmJohnston Nov 18, 2024

Choose a reason for hiding this comment

sergey-tihon Nov 18, 2024

Choose a reason for hiding this comment

MalcolmJohnston Nov 26, 2024

Choose a reason for hiding this comment

sergey-tihon Dec 8, 2024

Choose a reason for hiding this comment

sergey-tihon commented Dec 8, 2024

MalcolmJohnston commented Nov 11, 2024 •

edited

Loading

sergey-tihon Nov 17, 2024 •

edited

Loading

MalcolmJohnston Nov 18, 2024 •

edited

Loading