Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ignore unparsed entity logic for predefined xml entities #266

Merged
merged 1 commit into from
May 27, 2024

Conversation

SethFalco
Copy link
Contributor

@SethFalco SethFalco commented Dec 26, 2023

There was a bug in the implementation for resolving entities. Predefined XML entities like <, &, and " should not be interpreted, as this breaks parsing of the document.

For example, see the following XML document (an SVG) and render in the browser:

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
<!ENTITY entity_reference "<text x='4' y='64'>entity reference</text>">
<!ENTITY escaped_entity_reference "&lt;text&gt;escaped entity reference&lt;/text&gt;">
]>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 300 128">
  <text x="4" y="32">
    &lt;text&gt;escaped literal&lt;/text&gt;
  </text>
  &entity_reference;
  <text x="4" y="96">
    &escaped_entity_reference;
  </text>
</svg>

We should resolve custom entities like entity_reference and escaped_entity_reference. However, not &amp, &lt;, or &#34;, otherwise they get interpreted by the parser instead.

Currently, when parsing this XML document in sax v1.3.0 we get this error:

Error: Invalid character in tag name
Line: 6
Column: 14
Char: &
    at error (/home/seth/Documents/repos/maintainer/svg/svgo/node_modules/sax/lib/sax.js:652:10)
    at strictFail (/home/seth/Documents/repos/maintainer/svg/svgo/node_modules/sax/lib/sax.js:678:7)
    at SAXParser.write (/home/seth/Documents/repos/maintainer/svg/svgo/node_modules/sax/lib/sax.js:1284:17)

Related


Just an aside, if you'd like some help with maintaining sax, I'd love to lend a hand. We've been using this in SVGO, and since joining as a maintainer I've been meaning to move away from our (unmaintained) fork back to upstream. No pressure, and I'll see if I can open other pull requests in time.

If you'd accept it, I think it would be very valuable to have a task in CI that attempts to parse the XML Conformance Test Suites just to ensure no errors are thrown. That also would've caught this bug. ^-^'

https://www.w3.org/XML/Test/

Update

Since opening this pull request, I threw together a script for regression testing.

- name: Regression
  run: |
    wget https://www.w3.org/XML/Test/xmlts20130923.tar.gz
    tar -tf xmlts20130923.tar.gz | grep -E '^xmlconf/xmltest/valid/sa/[0-9]+\.xml$' > filter.txt
    tar -xf xmlts20130923.tar.gz -T filter.txt
    FAILURES=0
    for FILE in $(grep -REL xmlconf/ -e '<!ENTITY\s+(\S+)\s+"')
    do
      if [ $(file -bi $FILE | sed -e "s/.* charset=//") == 'utf-16le' ]; then echo "$(iconv -f utf-16le -t utf-8 $FILE)" > $FILE; fi
      node -e "require('./lib/sax').parser(true).write(require('fs').readFileSync('$FILE', { encoding: 'utf8' })).close();" \
        || echo "$FILE failed" && ((FAILURES+=1))
      node -e "require('./lib/sax').parser(true, { unparsedEntities: true }).write(require('fs').readFileSync('$FILE', { encoding: 'utf8' })).close();" \
        || echo "$FILE failed with { unparsedEntities: true }" && ((FAILURES=+1))
    done
    if [ FAILURES != 0 ]; then exit 1; fi
  • Fetches the conformance test suite from W3C.
  • Extracts only valid standalone XML documents. This task is only to ensure the parser doesn't throw exceptions on valid documents.
  • Skips documents that declare custom entities, since the library doesn't handle that on its own.
  • Re-encode utf-16le files as utf-8 so it's easy to read from Node.js.

This actually helped me identify a problem with my initial solution. I didn't account for the fact that &quot; isn't the only way to insert quotes. One can do &#34; instead, so now I'm comparing against Object.values(XML_ENTITIES) rather than checking if the entity string exists as a property of XML_ENTITIES.

@SethFalco SethFalco changed the title fix: ignore unparsed entity logic for std xml entities fix: ignore unparsed entity logic for predefined xml entities Dec 26, 2023
@SethFalco SethFalco force-pushed the fix-unparsed-entities branch 3 times, most recently from a90d788 to fbe7404 Compare December 27, 2023 02:22
Copy link
Contributor

@HeikoTheissen HeikoTheissen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I hadn't thought of this case when I suggested #260.

@SethFalco
Copy link
Contributor Author

@isaacs Sorry to ping, but I was wondering if you had time to review my three pull requests to this repo? I'd be happy to sponsor $100 via GitHub Sponsors if we can get all three reviewed and hopefully merged/released?

These would all help close issues in SVGO, which depends on sax-js.

@isaacs isaacs merged commit 93a63dd into isaacs:main May 27, 2024
isaacs pushed a commit that referenced this pull request May 27, 2024
@SethFalco SethFalco deleted the fix-unparsed-entities branch May 27, 2024 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse result of entity replacement
3 participants