-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#417149 - Platinum vs. Classic - Display of Specifications #11
Comments
Two issues: Whitelist sanitizers like Caja will strip unknown tags so the single tag case will be removed, blacklist sanitizers will leave it in but then when inserted into the DOM the browser will treat them as tags and they will 'vanish' so whatever sanitizer we use the content will be effectively removed if we are applying the sanitization on output. It is easy to sanitize keyboard input, tags are escaped, the hard part comes with pasted input. How to sanitize it? If you knew it was plain text you can escape everything else very quickly. But more often than not it will be HTML, you cant simply escape it Carriage returns are also an issue particularly for us, even though carriage returns don't exist in HTML we treat them as if they do and have to try to preserve returns even in HTML content. The issue is further complicated because we don't distinguish between stored HTML and text data. An example is the short and long description in Specifications, the input uses a TextArea control so it can only be pure text, but on display it runs through the flex formatter so it is output as HTML with crlf converted to BR tags. So if the user enters a pseudo tag in the textarea it is valid and cannot be escaped or it will not be editable again but on display it must be escaped because it is being output as HTML but we can't possibly postfix/escape pseudo tags in HTML. This is further compounded with the question of what format was used in classic if it was text we can solve the problem by changing the formatting in platinum for this specific case from html to text but if it was html formatted in classic then we can't simply switch the formatting, the use of flex as opposed to html supposedly indicates it was html formatted in classic. Similarly with pasted content to generic input controls, this is pure text and should never be sanitized. |
Another issue is invalid HTML input, generally caused by cut and pasting partial tags, the user selects some text in another application and when copied it includes the opening tag but not the closing tag. If not fixed on input, this will break the application on output display. |
The do it on input or do it on output choice is also very important from a performance perspective. On input we can live with say a frames delay whilst we perform an async sanitisation, it's for one user only, and one time only. On output there are generally dozens to hundreds of separate pieces of text to be sanitised at once (say in a list) and the process gets repeated every time the DataSource is refreshed. On input we know if its text or HTML on output we don't. It really can't make sense to do this on output. |
So the solution is multipart. Any pasted content should also be sanitised before inserting into the application. As far as I am aware the only place RTF content can be pasted into the app is via the RTE (pasting into textarea or input controls automatically strips all markup). If this was done no sanitization would be required at runtime which would be a performance boost. |
But how to handle historic content that has not been sanitised? Really this needs handling now and will be a real pain but it will decline with time but key to this is being able to distinguish between sanitized and unsanitized content. The only way I can see to do this is to add a hidden flag into sanitised content. This would be some HTML tag that is invisible to the user. It would either be a tag they would never use or have a unique attribute. |
So that was the how, now what to use for sanitization. The plugin looks like it was abandoned 3 years ago and some important updates are not being made (not HTML5 aware), it also doesn't for example 'fix' single quotes. Documentation seems almost non existent or has been lost over the years and generally points at blank pages. but it mostly does the job I guess. It is a whitelist based sanitizer and requires installing ANT the java based taskrunner to compile a whitelist, it only supports HTML4 not HTML5, maybe we can work around that with a custom whitelist. Seems most people push it into an iframe or a webworker because of overhead and conflicts. But really we need to both whitelist certain known tags, and also blacklist certain known tags so that we can then allow through all other pseudo tags by escaping them. A simple whitelist approach won't be sufficient, we either need a sanitizer that support both black and white lists or it whitelists but has a callback to check every node on the fly so we can control the process. |
So to summarise:
Any sanitizer MUST be able to achieve all of the above in order to successfully fix all the issues we currently have |
All sanitizer's that work directly with the DOM can only be used on input sanitization, the overhead involved in inserting content into an iframe, parsing and instantiating it before manipulation can start rules out this approach for output. Possible modules: https://github.com/ecto/bleach - really it is he and fs wrapped but looks very easy to modify to our purposes. https://github.com/punkave/sanitize-html support filter and transformText https://github.com/JamesRyanATX/janitor-js DOM Based - use only as base code https://github.com/google/caja/blob/master/src/com/google/caja/plugin/html-sanitizer.js - Caja whitelist only https://github.com/dortzur/simple-sanitizer - requires a DOM would require rewriting to do what we want but look straightforward to do so https://github.com/gbirke/Sanitize.js - whitelist but sophisticated options includes transformers and filter |
Regex simple sanitizer
|
Most non-DOM sanitizer are built on top of an html parser, Caja is based on the old java SAX parser, most of the rest are derived from John Resig's original htmlParser. |
DOM walking/sanitizing code
|
Conclusion: We need to split the sanitization into two phases input and output. INPUT
So we use a contenteditable but strip the string before insertion as this seems to work ok at the moment. The overhead involved will be ok for a single instance and a one time process. The cleaned up input will be tagged as such. OUTPUT For historic content that has not been processed we have to acknowledge we cannot sanitize it completely at runtime as the overhead is too great given the number of simultaneous instances that need processing and that it needs to be repeatedly done on every refresh. We also recognise that 6 months down the line the input phases fix will have effectively eliminated the problem. The aim here then is to sanitize in a performant but limited way, i.e. no worse than we currently do. This could probably be done by adding a couple of functions to our existing code rather than needing a new module |
Sample tree balancing code
|
Sample non-html text with pseudo tags
|
Note:
a prefix means don't wrap links flex means the source could potentially be from classic in html format Critically the return of this function is always html and marked user content only |
There are 345 TextArea and 47 TextEditor controls . A problem noted above is text that is input via a TextArea and that is then output as html using the flex or html formats. Text can contain things that needs encoding for html display, they also could be pre formatted containing multiple sequential spaces or tabs. The entities need escaping and spaces and formatting need preserving by wrapping in a pre tag. The issue is how to identify these situations. We could encode on a save from a textarea and decode when putting back in the TextArea. This is the most efficient as it happens only once and also we know for sure the content is text. But how do we know every TextArea output is really meant to be html? Given there are 345 Text but only 47 RTF controls it seems that the vast majority are text input meant for text output so encode/decode at input is probably not feasible but who knows? We could encode at run time on output but how do we know one string is pure text and should be encoded and is preformatted and that another string is just html and shouldn't be encoded? Well if you set the format to 'text' that is an indicator that it is pure text and can be safely encoded but in reality there are format set as 'html' but really they are pure text. We need to support these as otherwise they are regarded as framework 'bugs' as this issue is an example of. So we scan every string to see if it contains html and since that is problematic we approximate that by looking for closing tags and void elements like img. If you wrap text with pre it to preserve formatting it won't word wrap unless you put in returns, that's the point of preformatted, they are preformatted by the user. You can add styling to make it wrap but then maybe you break someones pre formatting. So what to do, every choice risks being wrong in some case? and should we try to convert 'links' in text to be clickable? |
word-wrap: break-word;//(alias for overflow-wrap) ie,chrome,safari word-wrap ms proprietary value implemented by most, overflow-wrap the standard implemented by all modern browsers, but noticed the subtle issue, you need both to get full coverage!! |
Yet another significant issue: removeFormat
So what can we do? |
https://padmin.workamajig.com/platinum/?aid=wjadmin.brain.edit&k=417149
Marlene -
Found this issue today between Classic and Platinum.
This is in Job ID # 1806295
Related to the print specification sections.
See the below (3) screen shots.
Apparently, Platinum does not like the following characters < >
The specification preview screen does not show the words between those characters, but they do show in the editing specification screen. I think this is a glitch in the system and we need it fixed.
Please share and let me know.
Thanks,
The text was updated successfully, but these errors were encountered: