Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for HTML fragments in Document Assembler Content tag #86

Conversation

MalcolmJohnston
Copy link
Contributor

@MalcolmJohnston MalcolmJohnston commented Nov 11, 2024

Hello Sergey,

Hope you are well and thanks for merging my previous PRs.

This PR adds support for using HTML fragments with the Content tag in Document Assembler. The supported inline tags are:

  • b/strong
  • i/em
  • u
  • a
  • br

Both div and p tags are supported block level tags, any other HTML tag will simply be ignored. Tags can be combined so for example the following works:

<a href="bbc.co.uk"><b><i><u>BBC</u></i></b></a>

I have added a new test, and a tutorial to the docs site.

This resolves #70 in my opinion although as noted in the Tutorial there are some other tags that we could potentially add support for.

One thing to note is that this adds a dependency on the HTMLAgilityPack.

Any questions then let me know, I'll hopefully do another PR later in the week for supporting pulling in Document/Templates from other sources.

Cheers,
Malcolm

MalcolmJohnston and others added 6 commits November 11, 2024 14:30
Updated DA Multiline content test because the HtmlConverter creates paragraphs rather than inserting soft line breaks.
…nts in a Content select (joins them with a new line).

Ran csharpier on all files.
Added HtmlAgilityPack to paket references.

namespace Clippit.Word.Assembler
{
internal static class HtmlConverter
Copy link
Owner

@sergey-tihon sergey-tihon Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PowerTools already has an HTML to Wml converter that supports a wide range of HTML tags, css styles, fonts, ol/ui lists, etc.

https://github.com/sergey-tihon/Clippit/blob/master/Clippit/Html/HtmlToWmlConverterCore.cs#L184-L190

It seems like it currently builds the whole document, but maybe we can refactor it and reuse it from the document assembler?

@MalcolmJohnston what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sergey-tihon thanks for coming back to me.

When I first added the support for simple HTML in DocumentAssembler I did consider whether to use the HtmlToWmlConverter.

The reason I didn't go with that approach was partially, because I had a hard time grasping all of the HTML to Wml Converter code, but also because I think the use-cases are different.

As you say currently the Html to Wml Convert is designed for entire documents, and it has a lot of clever functionality for dealing with styles and fonts so that you get a high fidelity conversion.

With the Document Assembler use case I think that we do not need that, and actually it could be retrograde. The reason is that with Document Assembler we are typically dealing with highly structured documents with specific document styles. So we wouldn't want the fonts and styles to be altered in the way that Html to Wml Converter does.

If we could find a way to bring HtmlConverter.cs more inline with Html to Wml Converter though then I could see the benefit of that.

Not sure that helps a lot, but some thoughts :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not mind keeping two implementations for a while (at least till we refactor the old one and make it universal and usable for both use cases)

However, I do like that we need to add a dependency on HtmlAgilityPack just for this simpler implementation. As a nugget package author, we need to depend on as low a version of HtmlAgilityPack as possible to minimize the change of version conflict with another package that has a direct or transitive dependency on this package. Same time we need to keep track of all security vulnerabilities inside HtmlAgilityPack and bump versions when they are fixed. Can we leave with something like XElement xhtml for a while (as it done inside HtmlToWmlConverter)?

Copy link
Contributor Author

@MalcolmJohnston MalcolmJohnston Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for using HtmlAgilityPack is to handle HTML to XML conversion. The use-case for me is that we are sometimes pulling through HTML that is not well formed or at least it is not XHTML so that is the reason for adding HtmlAgilityPack.

I understand your concerns about adding dependencies though, I guess the argument can be made that the caller should ensure that the HTML is XHTML compliant when using this library, and so we only support XHTML in Clippit?

That would remove the requirement for HtmlAgilityPack being a dependency. It would also make the conversion of my existing HtmlConverter code to something closer to HtmlToWmlConvertorCore simpler in future.

Let me know your thoughts.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the argument can be made that the caller should ensure that the HTML is XHTML compliant when using this library, so we only support XHTML in Clippit?

Yes, exactly, the caller will have a choice:

  1. When HTML is known during compile time, it may be easy enough to guarantee that it is XHTML compliant and does not pay runtime costs.
  2. Add extra dependency to HtmlAgilityPack or equivalent, and support non XHTML compliant HTML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so the action here is to remove HtmlAgilityPack and update the implementation accordingly?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Sergey,

Apologies for the delay in getting back to you. I had another look at this and decided to have a go at refactoring the HtmlConverter code so that it is closer to HtmlToWmlConverterCore.

This is not a complete refactoring as there is a fair amount of work to do, however it does start us along the journey.

What I have done is to create a trimmed down version of the Transform method from HtmlConverterCore in HtmlConverter. To facilitate this I have made a number of methods and classes internal rather than private.

The main difference between the two methods are:

  1. We do not support all elements in the HtmlConverter version
  2. We do not apply paragraph level styles in the HtmlConverter version (because these should be taken from the template)

The implementation as it is adds support for sub, sup and s HTML elements which is an improvement on before. I also think it provides a pathway towards additional refactoring and further support for HTML tags in HtmlConverter.

I am not intending to do these now, but some of the next steps that would be required are:

  1. Decide on a mechansim for not applying Paragraph properties in the Transform method (use existing settings class?)
  2. Understand and add support for OL/LI tags (requires some pre-processing based on HtmlToWmlConverterCore
  3. Understand how CSS application works and add to HtmlConverter
  4. Refactor the Transform method so that it can take a "specification" of the elements that are supported, this could also be considered as an extensibility mechanism i.e. callers could override the default behaviour if needed

This should bring us a lot closer to a common code base for HTML processing which works both for entire documents and for document fragments as required by DocumentAssembler.

Let me know your thoughts.

As an aside did you see that https://github.com/OpenXmlDev/Open-Xml-PowerTools is looking for a new maintainer? I thought it might be worth approaching them and suggesting they forward people onto Clippit instead? As a lot of work has already been done in this repository.

Cheers,
Malcolm

Added tests for HTML content (not supported) and non-well formed XHTML.
Looking at possibility of replacing XMLReader approach with something closer to HtmlToWmlConverter.
MalcolmJohnston and others added 3 commits November 26, 2024 17:24
…-HTML-support-via-HtmlToWmlConverter

Feature/document assembler html support via html to wml converter
@@ -1003,7 +965,7 @@ OpenXmlPart part
// assign unique image and paragraph ids. Image id is document property Id (wp:docPr)
// and relationship id is rId. Their numbering is different.
const string imageId = InvalidImageId; // Ids will be replaced with real ones later, after transform is done
var relationshipId = GetNextImageRelationshipId(part);
var relationshipId = Relationships.GetNewRelationshipId();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MalcolmJohnston just wondering why you decided to use random generated id instead of deterministic/sequential one?

@sergey-tihon sergey-tihon merged commit ab3be1f into sergey-tihon:master Dec 8, 2024
4 checks passed
@sergey-tihon
Copy link
Owner

Released in NuGet Gallery | Clippit 2.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentAssembler: Text formatting?
2 participants