Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMP_DOM_Document & meta tag sanitizer #3758

Merged
merged 67 commits into from
Dec 19, 2019
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
e2951f5
Add initial test for meta tag sanitization
schlessera Nov 15, 2019
395169d
Add initial implementation of meta tag sanitizer
schlessera Nov 15, 2019
a569500
Satisfy PHPCS
schlessera Nov 15, 2019
043187a
Accept uppercase UTF-8 charset
schlessera Nov 15, 2019
847c753
Improve tests
schlessera Nov 15, 2019
834f26c
Extract more code out of AMP_Theme_Support
schlessera Nov 15, 2019
1ceb739
Get rid of viewport references in AMP_Theme_Support
schlessera Nov 15, 2019
ab1b29a
Add additional scripts support
schlessera Nov 18, 2019
97d6207
Refactor sanitizer to reduce DOM queries and manipulations
schlessera Nov 18, 2019
e68f592
Fix PHPCS issues
schlessera Nov 18, 2019
d726fee
Remove meta tag handling from AMP_Theme_Support
schlessera Nov 19, 2019
e388a22
Add and test amp-script-src handling
schlessera Nov 19, 2019
71780f0
Fix PHPCS issues
schlessera Nov 19, 2019
7192d67
Modify broken test in AMP_Theme_Support
schlessera Nov 19, 2019
7b57648
Apply suggestions from code review
schlessera Nov 20, 2019
c36eff4
Add explanatory comment to reordering
schlessera Nov 20, 2019
8efc564
Remove pass-by-reference on $dom
schlessera Nov 20, 2019
4e952d9
Add AMP_DOM_Document abstraction
schlessera Nov 22, 2019
7a5ca46
Keep meta sanitizer at the end of the sanitisation run
schlessera Nov 22, 2019
9267e8e
Fix sanitizer ordering test
schlessera Nov 22, 2019
1e50685
Satisfy PHPCS
schlessera Nov 22, 2019
ed93ad6
Wrap mb_* in function_exists()
schlessera Nov 25, 2019
c98e7e5
Fix DOMXPath type case mismatch
schlessera Nov 25, 2019
51a2002
Remove useless setting of encoding property
schlessera Nov 25, 2019
55b9877
Add safeguard around removing http-equiv charset
schlessera Nov 25, 2019
2bdf1b9
Adapt menu toggle test to ignore head
schlessera Nov 25, 2019
cf34477
Harden document structure parsing
schlessera Nov 25, 2019
a9f1a04
Converting automatically to UTF-8 seems to work
schlessera Nov 27, 2019
fdd72cc
Simplify meta sanitizer
schlessera Nov 27, 2019
9ba845c
Simplify meta sanitizer tests
schlessera Nov 27, 2019
c7ae4c3
Use convenience of extended DOM document
schlessera Nov 27, 2019
71a3e1f
Move theme support and dom utils to new dom document
schlessera Nov 27, 2019
788ee17
Move DOMElementList and AMP_DOM_Document into Dom subnamespace
schlessera Dec 5, 2019
66efba6
Move remaining special DOM handling over to Dom\Document and use it e…
schlessera Dec 5, 2019
c39e1a3
Fix wrong replacement in mustache placeholders
schlessera Dec 5, 2019
fcf7a14
Fix wrong replacement in self-closing tags logic
schlessera Dec 5, 2019
2d280e7
Fix structure normalization in DOM space
schlessera Dec 5, 2019
189b7ca
Fix Twitter embed handler tests
schlessera Dec 5, 2019
442fb11
Normalize into data-amp-bind notation (as it already was before)
schlessera Dec 5, 2019
efa0bed
Move and fix amp-bind conversion test
schlessera Dec 6, 2019
a9c4d8e
Refactor structure normalization to avoid excessive backtracking
schlessera Dec 9, 2019
b50fd79
Fix bug in script reordering
schlessera Dec 9, 2019
714b762
Adjust alignment
schlessera Dec 9, 2019
1eec1f4
Fix alignment in twitter embed tests after rebase
schlessera Dec 9, 2019
9088fdc
Enforce usage of Dom\Document and deprecated some DOM_Utils helpers
schlessera Dec 11, 2019
72ae0c6
Provide a way to retrieve the Dom\Document for a given node
schlessera Dec 17, 2019
18e35d2
Remove Twitter shortcode tests, [tweet] handling was moved to Jetpack
schlessera Dec 17, 2019
3bd83a4
Enforce consistent charset meta tag
schlessera Dec 17, 2019
3bdc795
Make PHPCS happy
schlessera Dec 17, 2019
f4c1013
Fix tests broken from rebase again
schlessera Dec 18, 2019
9a57bc2
Avoid cloning the dom in test
schlessera Dec 18, 2019
19334a0
Remove unneeded test
schlessera Dec 18, 2019
2010c49
Fix reference to xpath_from_css_selector method
westonruter Dec 18, 2019
01848c7
Eliminate usage of deprecated AMP_DOM_Utils::AMP_BIND_DATA_ATTR_PREFIX
westonruter Dec 18, 2019
e763fe3
Harden type checking for XPath queries
westonruter Dec 18, 2019
cc4badb
Improve phpdoc
westonruter Dec 18, 2019
8cbe6d6
Remove unused DEFAULT_ARGS from Meta sanitizer; add phpdoc and todo c…
westonruter Dec 18, 2019
fbd790f
Apply suggestions from code review
schlessera Dec 19, 2019
41094b4
Move is_valid_head_node() into Dom\Document
schlessera Dec 19, 2019
6ff286c
Return both content & encoding via array in detect_and_strip_encoding()
schlessera Dec 19, 2019
e94e3ca
Optimize sanitize_encoding()
schlessera Dec 19, 2019
c5415d2
Actually use charset in document test
schlessera Dec 19, 2019
7c037a9
Always use imported relative class name Document in @covers annotations
schlessera Dec 19, 2019
8abb927
Reset internal optimizations on clone
schlessera Dec 19, 2019
a214a25
Avoid depending on AMP_DOM_Utils in Dom\Document
schlessera Dec 19, 2019
46af79f
Remove double line break
schlessera Dec 19, 2019
c739df0
Make reset() private
schlessera Dec 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion includes/sanitizers/class-amp-script-sanitizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ public function sanitize() {
$fragment->appendChild( $this->dom->createComment( 'noscript' ) );
while ( $noscript->firstChild ) {
if ( $is_inside_head_el && ! $must_move_to_body ) {
$must_move_to_body = ! AMP_DOM_Utils::is_valid_head_node( $noscript->firstChild );
$must_move_to_body = ! $this->dom->is_valid_head_node( $noscript->firstChild );
}
$fragment->appendChild( $noscript->firstChild );
}
Expand Down
2 changes: 1 addition & 1 deletion includes/sanitizers/class-amp-style-sanitizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -2795,7 +2795,7 @@ private function remove_admin_bar_if_css_excluded() {
if ( ! $included ) {
// Remove admin-bar class from body element.
// @todo It would be nice if any style rules which refer to .admin-bar could also be removed, but this would mean retroactively going back over the CSS again and re-shaking it.
if ( $this->dom->body instanceof DOMElement && $this->dom->body->hasAttribute( 'class' ) ) {
if ( $this->dom->body->hasAttribute( 'class' ) ) {
$this->dom->body->setAttribute(
'class',
preg_replace( '/(^|\s)admin-bar(\s|$)/', ' ', $this->dom->body->getAttribute( 'class' ) )
Expand Down
36 changes: 6 additions & 30 deletions includes/utils/class-amp-dom-utils.php
Original file line number Diff line number Diff line change
Expand Up @@ -47,39 +47,19 @@ class AMP_DOM_Utils {
*/
const HTML_BODY_CONTENTS_PATTERN = '#^.*?<body.*?>(.*)</body>.*?$#si';

/**
* List of elements allowed in head.
*
* @link https://github.com/ampproject/amphtml/blob/445d6e3be8a5063e2738c6f90fdcd57f2b6208be/validator/engine/htmlparser.js#L83-L100
* @link https://www.w3.org/TR/html5/document-metadata.html
*
* @todo Turn into const array once PHP minimum is bumped to 5.6+.
*
* @var array
*/
private static $elements_allowed_in_head = [
'title',
'base',
'link',
'meta',
'style',
'noscript',
'script',
];

/**
* Return a valid Dom\Document representing HTML document passed as a parameter.
*
* @since 0.7
* @see AMP_DOM_Utils::get_content_from_dom_node()
* @deprecated Use Dom\Document::from_html( $html, $encoding ) instead.
* @deprecated Use Amp\AmpWP\Dom\Document::from_html( $html, $encoding ) instead.
*
* @param string $document Valid HTML document to be represented by a Dom\Document.
* @param string $encoding Optional. Encoding to use for the content.
* @return Document|false Returns Dom\Document, or false if conversion failed.
*/
public static function get_dom( $document, $encoding = null ) {
_deprecated_function( __METHOD__, '1.5.0', 'Dom\Document::from_html()' );
_deprecated_function( __METHOD__, '1.5.0', 'Amp\AmpWP\Dom\Document::from_html()' );
return Document::from_html( $document, $encoding );
}

Expand All @@ -88,18 +68,14 @@ public static function get_dom( $document, $encoding = null ) {
*
* @link https://github.com/ampproject/amphtml/blob/445d6e3be8a5063e2738c6f90fdcd57f2b6208be/validator/engine/htmlparser.js#L83-L100
* @link https://www.w3.org/TR/html5/document-metadata.html
* @deprecated Use Amp\AmpWP\Dom\Document->is_valid_head_node() instead.
*
* @param DOMNode $node Node.
* @return bool Whether valid head node.
*/
public static function is_valid_head_node( DOMNode $node ) {
return (
( $node instanceof DOMElement && in_array( $node->nodeName, self::$elements_allowed_in_head, true ) )
||
( $node instanceof DOMText && preg_match( '/^\s*$/', $node->nodeValue ) ) // Whitespace text nodes are OK.
||
$node instanceof DOMComment
);
_deprecated_function( __METHOD__, '1.5.0', 'Amp\AmpWP\Dom\Document->is_valid_head_node()' );
return Document::from_node( $node )->is_valid_head_node( $node );
}

/**
Expand Down Expand Up @@ -219,7 +195,7 @@ public static function get_content_from_dom( Document $dom ) {
* @return string Returns the HTML content represented in the DOMNode
*/
public static function get_content_from_dom_node( Document $dom, $node ) {
_deprecated_function( __METHOD__, '1.5.0', 'Dom\Document->saveHtml( $node )' );
_deprecated_function( __METHOD__, '1.5.0', 'Amp\AmpWP\Dom\Document::saveHtml()' );
return $dom->saveHTML( $node );
}

Expand Down
111 changes: 88 additions & 23 deletions src/Dom/Document.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,11 @@

namespace Amp\AmpWP\Dom;

use AMP_DOM_Utils;
use DOMComment;
use DOMDocument;
use DOMElement;
use DOMNode;
use DOMText;
use DOMXPath;

/**
Expand Down Expand Up @@ -119,7 +120,6 @@ final class Document extends DOMDocument {
const HTML_FIND_TAG_WITH_ATTRIBUTE_PATTERN = '/<%1$s [^>]*?\s*%2$s=[^>]*?>[^<]*(?:<\/%1$s>)?/i';
const HTML_EXTRACT_ATTRIBUTE_VALUE_PATTERN = '/%s=(?:([\'"])(?<full>.*)?\1|(?<partial>[^ \'";]+))/';


// Tags constants used throughout.
const TAG_HEAD = 'head';
const TAG_BODY = 'body';
Expand Down Expand Up @@ -181,6 +181,26 @@ final class Document extends DOMDocument {
'wbr',
];

/**
* List of elements allowed in head.
*
* @link https://github.com/ampproject/amphtml/blob/445d6e3be8a5063e2738c6f90fdcd57f2b6208be/validator/engine/htmlparser.js#L83-L100
* @link https://www.w3.org/TR/html5/document-metadata.html
*
* @todo Turn into const array once PHP minimum is bumped to 5.6+.
*
* @var string[]
*/
private static $elements_allowed_in_head = [
'title',
'base',
'link',
'meta',
'style',
'noscript',
'script',
];

/**
* Store the placeholder comments that were generated to replace <noscript> elements.
*
Expand Down Expand Up @@ -258,6 +278,22 @@ public static function from_node( DOMNode &$node ) {
return $dom;
}

/**
* Reset the internal optimizations of the Document object.
*
* This might be needed if you are doing an operation that causes the cached
* nodes and XPath objects to point to the wrong document.
*
* @return self Reset version of the Document object.
*/
public function reset() {
schlessera marked this conversation as resolved.
Show resolved Hide resolved
// Drop references to old DOM document.
unset( $this->xpath, $this->head, $this->body );

// Reference of the document itself doesn't change here, but might need to change in the future.
return $this;
}

/**
* Load HTML from a string.
*
Expand All @@ -268,15 +304,14 @@ public static function from_node( DOMNode &$node ) {
* @return bool true on success or false on failure.
*/
public function loadHTML( $source, $options = 0 ) {
// Drop references to old DOM document.
unset( $this->xpath, $this->head, $this->body );
$this->reset();

$source = $this->convert_amp_bind_attributes( $source );
$source = $this->replace_self_closing_tags( $source );
$source = $this->normalize_document_structure( $source );
$source = $this->maybe_replace_noscript_elements( $source );

$this->original_encoding = $this->detect_and_strip_encoding( $source );
list( $source, $this->original_encoding ) = $this->detect_and_strip_encoding( $source );

if ( self::AMP_ENCODING !== strtolower( $this->original_encoding ) ) {
$source = $this->adapt_encoding( $source );
Expand Down Expand Up @@ -306,7 +341,8 @@ public function loadHTML( $source, $options = 0 ) {
}

// Add the required utf-8 meta charset tag.
$charset = AMP_DOM_Utils::create_node( $this, 'meta', [ 'charset' => self::AMP_ENCODING ] );
$charset = $this->createElement( 'meta' );
$charset->setAttribute( 'charset', self::AMP_ENCODING );
$this->head->insertBefore( $charset, $this->head->firstChild );

// Do some further clean-up.
Expand All @@ -328,15 +364,9 @@ public function saveHTML( DOMNode $node = null ) {

// Force-add http-equiv charset to make DOMDocument behave as it should.
// See: http://php.net/manual/en/domdocument.loadhtml.php#78243.
$charset = AMP_DOM_Utils::create_node(
$this,
'meta',
[
'http-equiv' => self::HTML_HTTP_EQUIV_VALUE,
'content' => self::HTML_HTTP_EQUIV_CONTENT_VALUE,
]
);

$charset = $this->createElement( 'meta' );
$charset->setAttribute( 'http-equiv', self::HTML_HTTP_EQUIV_VALUE );
$charset->setAttribute( 'content', self::HTML_HTTP_EQUIV_CONTENT_VALUE );
$this->head->insertBefore( $charset, $this->head->firstChild );

if ( null === $node || PHP_VERSION_ID >= 70300 ) {
Expand Down Expand Up @@ -479,7 +509,7 @@ private function move_invalid_head_nodes_to_body() {
$node = $this->head->lastChild;
while ( $node ) {
$next_sibling = $node->previousSibling;
if ( ! AMP_DOM_Utils::is_valid_head_node( $node ) ) {
if ( ! $this->is_valid_head_node( $node ) ) {
$this->body->insertBefore( $this->head->removeChild( $node ), $this->body->firstChild );
}
$node = $next_sibling;
Expand Down Expand Up @@ -661,7 +691,7 @@ private function convert_amp_bind_attributes( $html ) {
$html
);

/**
/*
* If the regex engine incurred an error during processing, for example exceeding the backtrack
* limit, $converted will be null. In this case we return the originally passed document to allow
* DOMDocument to attempt to load it. If the AMP HTML doesn't make use of amp-bind or similar
Expand Down Expand Up @@ -706,9 +736,14 @@ private function adapt_encoding( $source ) {
* Detect the encoding of the document.
*
* @param string $content Content of which to detect the encoding.
* @return string|false Detected encoding of the document, or false if none.
* @return array {
* Detected encoding of the document, or false if none.
*
* @type string $content Potentially modified content.
* @type string|false $encoding Encoding of the content, or false if not detected.
* }
*/
private function detect_and_strip_encoding( &$content ) {
private function detect_and_strip_encoding( $content ) {
$encoding = self::UNKNOWN_ENCODING;

// Check for HTML 4 http-equiv meta tags.
Expand All @@ -734,7 +769,7 @@ private function detect_and_strip_encoding( &$content ) {
$content = str_replace( $charset_tag, '', $content );
}

return $encoding;
return [ $content, $encoding ];
}

/**
Expand Down Expand Up @@ -828,11 +863,13 @@ private function sanitize_encoding( $encoding ) {
$known_encodings = array_map( 'strtolower', mb_list_encodings() );
}

if ( array_key_exists( strtolower( $encoding ), self::$encoding_map ) ) {
$encoding = self::$encoding_map[ strtolower( $encoding ) ];
$lc_encoding = strtolower( $encoding );

if ( isset( self::$encoding_map[ $lc_encoding ] ) ) {
$encoding = self::$encoding_map[ $lc_encoding ];
}

if ( ! in_array( strtolower( $encoding ), $known_encodings, true ) ) {
if ( ! in_array( $lc_encoding, $known_encodings, true ) ) {
return self::UNKNOWN_ENCODING;
}

Expand Down Expand Up @@ -932,6 +969,25 @@ private function get_mustache_tag_placeholders() {
return $placeholders;
}

/**
* Determine whether a node can be in the head.
*
* @link https://github.com/ampproject/amphtml/blob/445d6e3be8a5063e2738c6f90fdcd57f2b6208be/validator/engine/htmlparser.js#L83-L100
* @link https://www.w3.org/TR/html5/document-metadata.html
*
* @param DOMNode $node Node.
* @return bool Whether valid head node.
*/
public function is_valid_head_node( DOMNode $node ) {
return (
( $node instanceof DOMElement && in_array( $node->nodeName, self::$elements_allowed_in_head, true ) )
||
( $node instanceof DOMText && preg_match( '/^\s*$/', $node->nodeValue ) ) // Whitespace text nodes are OK.
||
$node instanceof DOMComment
);
}

/**
* Magic getter to implement lazily-created, cached properties for the document.
*
Expand Down Expand Up @@ -965,4 +1021,13 @@ public function __get( $name ) {
trigger_error( self::PROPERTY_GETTER_ERROR_MESSAGE . $name, E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput
return null;
}

/**
* Make sure we properly reinitialize on clone.
*
* @return void
*/
public function __clone() {
$this->reset();
}
}
2 changes: 1 addition & 1 deletion tests/php/test-amp-style-sanitizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -2406,7 +2406,7 @@ public function test_prioritized_stylesheets( $html_generator, $assert ) {
$html = $html_generator();

$original_dom = Document::from_html( $html );
$amphtml_dom = Document::from_html( $html );
$amphtml_dom = clone( $original_dom );

$error_codes = [];
$args = [
Expand Down
Loading