Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMP_DOM_Document & meta tag sanitizer #3758

Merged
merged 67 commits into from
Dec 19, 2019

Conversation

schlessera
Copy link
Collaborator

@schlessera schlessera commented Nov 15, 2019

Summary

This PR adds an abstraction for the DOM document and a new sanitizer AMP_Meta_Sanitizer that sanitizes meta tags in general, but more specifically for now the charset tag.

  • Provide Dom\Document to abstract away charset and document structure requirements.
  • Provide Dom\Document::from_html() to construct from markup (and deprecate AMP_DOM_Utils::get_dom()).
  • Provide Dom\Document::from_node() to construct from a node (and deprecate AMP_DOM_Utils::get_dom_from_content_node()).
  • Enforce basic HTML markup structure, including <head> and <body> elements.
  • Detect existing charset across possible HTML 4 / HTML 5 combinations.
  • Normalize all charsets to HTML 5 <meta charset="<charset>"> format.
  • Ensure a charset is present and add a default one as needed.
  • Detect when the AMP requirement for utf-8 is not met.
  • Convert non-UTF-8 encoding to UTF-8.
  • Move all relevant implementation specifics and compat code from AMP_DOM_Utils to internal Dom\Document methods and deprecate accordingly.

Fixes #3469
Fixes #855

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

@googlebot googlebot added the cla: yes Signed the Google CLA label Nov 15, 2019
@schlessera
Copy link
Collaborator Author

Note to self: don't push into a PR you've initiated but haven't created yet, it will force-create an empty one.

@schlessera schlessera changed the title ## Summary Add meta tag sanitizer to deal with http-equiv charsets Nov 15, 2019
@schlessera schlessera changed the title Add meta tag sanitizer to deal with http-equiv charsets [WIP] Add meta tag sanitizer to deal with http-equiv charsets Nov 15, 2019
@schlessera schlessera added Sanitizers Bug Something isn't working labels Nov 15, 2019
@schlessera
Copy link
Collaborator Author

We already have most of the meta tag logic in AMP_Theme_Support, but in a rather haphazard way.

Right now, the above code seems to fix the main bug we were chasing, but it introduces duplicate code.

I'll make some more changes now to extract the meta tag handling out of the AMP_Theme_Support class and put them into the new sanitizer instead. This will improve the quality of the sanitizers for stand-alone use and simplify the procedural mess that happens in AMP_Theme_Support.

@schlessera
Copy link
Collaborator Author

Note: the code in the AMP_Theme_Support class already states that it would be better off in a sanitizer instead:

* @todo All of this might be better placed inside of a sanitizer.

@schlessera schlessera changed the title [WIP] Add meta tag sanitizer to deal with http-equiv charsets [WIP] Add meta tag sanitizer Nov 18, 2019
@schlessera
Copy link
Collaborator Author

What was supposed to be a small refactor here turned out to be more of a mess than expected.

I started with a AMP_Meta_Sanitizer as the gist plugin that solved the initial bug used that, but I'm not happy with the current shape of that sanitizer.

I don't want to spend much more time on refactoring here, as I've already collected higher-level thoughts about this in #3763. I think one principle we should adhere to in future refactorings is to split sanitization (changing from incorrect into correct) from optimization (reordering and similar). The reordering is what not only causes the entire AMP_Theme_Support::ensure_required_markup() to be strictly procedural and hard to untangle, but also causes many other headaches when refactoring into more encapsulated code.

@schlessera
Copy link
Collaborator Author

There's two open issues here:

What do we do if we encounter a charset other than utf-8?

Do we try to convert the entire content encoding? Do we work on a known subset only that we can test? Do we just throw an error and force the user to deal with this before trying to use AMP?

What are the requirements for the viewport?

As already stated in Slack, the existing tests currently expect maximum-scale=1.0 to be sanitized away, while I can't find a reference in the documentation that states this as a requirement.

So, given the input <meta name="viewport" content="maximum-scale=1.0">, should the output be:
a.) <meta name="viewport" content="width=device-width"> (what the test expects)
b.) <meta name="viewport" content="width=device-width,maximum-scale=1.0"> (what I'm currently producing based on how I interpreted the docs.

@westonruter
Copy link
Member

What are the requirements for the viewport?

As already stated in Slack, the existing tests currently expect maximum-scale=1.0 to be sanitized away, while I can't find a reference in the documentation that states this as a requirement.

So, given the input <meta name="viewport" content="maximum-scale=1.0">, should the output be:
a.) <meta name="viewport" content="width=device-width"> (what the test expects)
b.) <meta name="viewport" content="width=device-width,maximum-scale=1.0"> (what I'm currently producing based on how I interpreted the docs.

Here are the allowed properties as well as the required device-width=width:

https://github.com/ampproject/amphtml/blob/e135c18c9460d4834578d6809ae5f77a1e3a231a/validator/validator-main.protoascii#L436-L445

You're right that it <meta name="viewport" content="maximum-scale=1.0"> should ideally be sanitized as <meta name="viewport" content="width=device-width,maximum-scale=1.0">. What is happening now is a bit more clear cut since it detects an invalid meta tag and then removes it (while triggering a validation error) and then later the proper one is put in its place. I suppose this could be done equally well with conversion merging… if we encounter a meta viewport tag we parse the properties, and if it doesn't include width=device-width then we can still trigger the validation error, and if the validation error is sanitized then instead of removing the node we can instead merge the properties.

@schlessera
Copy link
Collaborator Author

The PR already does the merging, I just wanted to verify with you whether I got the requirements right because of the broken test.

@westonruter
Copy link
Member

What do we do if we encounter a charset other than utf-8?

Do we try to convert the entire content encoding? Do we work on a known subset only that we can test? Do we just throw an error and force the user to deal with this before trying to use AMP?

Throwing a warning other than what we're currently doing?

// @todo If 'utf-8' is not the blog charset, then we'll need to do some character encoding conversation or "entityification".
if ( 'utf-8' !== strtolower( get_bloginfo( 'charset' ) ) ) {
/* translators: %s: the charset of the current site. */
trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error
}

I think we should try to convert the entire encoding if possible. I was thinking that something like this should be possible (untested code):

diff --git a/includes/class-amp-theme-support.php b/includes/class-amp-theme-support.php
index 11b7d9a3..fb518e4c 100644
--- a/includes/class-amp-theme-support.php
+++ b/includes/class-amp-theme-support.php
@@ -2249,6 +2249,16 @@ class AMP_Theme_Support {
 			);
 		}
 
+		// AMP requires UTF-8, so convert encoding.
+		if ( strtolower( get_bloginfo( 'charset' ) ) !== 'utf-8' ) {
+			if ( function_exists( 'mb_convert_encoding' ) ) {
+				$response = mb_convert_encoding( $response, 'utf-8', get_bloginfo( 'charset' ) );
+			} else {
+				/* translators: %s: the charset of the current site. */
+				trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error
+			}
+		}
+
 		$dom   = AMP_DOM_Utils::get_dom( $response );
 		$xpath = new DOMXPath( $dom );
 		$head  = $dom->getElementsByTagName( 'head' )->item( 0 );
@@ -2358,12 +2368,6 @@ class AMP_Theme_Support {
 			}
 		}
 
-		// @todo If 'utf-8' is not the blog charset, then we'll need to do some character encoding conversation or "entityification".
-		if ( 'utf-8' !== strtolower( get_bloginfo( 'charset' ) ) ) {
-			/* translators: %s: the charset of the current site. */
-			trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error
-		}
-
 		AMP_Validation_Manager::finalize_validation(
 			$dom,
 			[

However, last I tried it I don't recall having success. It has been awhile however.

Even so, if UTF-8 is not the encoding, we should still at least issue a warning in Site Health.


$this->ensure_charset_is_present( $charset );

if ( ! $this->is_correct_charset() ) { // phpcs:ignore Generic.CodeAnalysis.EmptyStatement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will need to be done prior to the DOMDocument being constructed, but it didn't occur to me that perhaps DOMDocument allows for encoding to be changed after parsing. I don't think it facilitates this, however.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOMDocument works with UTF-8 internally, so I think we need to have it converted before sending it to DOMDocument, or otherwise we might already have messed it up via loadHTML().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then is this is_correct_ charset() check needed then? Re-encoding it into UTF-8 cannot be done at this point. It would need to be done earlier. This sanitizer just needs to make sure that the utf-8 meta charset is present.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've put this into AMP_DOM_Document now.

static function ( $element ) {
return $element->parentNode->removeChild( $element );
},
iterator_to_array( $elements, false )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! We could be using this lots of places where we're currently iterating over a DOMNodeList to push onto an array which we then iterate over to potentially remove elements. Removing elements from the DOM while iterating over a DOMNodeList is problematic since it is a live node list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the DOMNodeList does not seem to have a built-in way to turn itself into an array, but any iterator can be turned into one via iterator_to_array().

Array/Traversable/Iterable handling in PHP is just a big mess, unfortunately.

*
* @return DOMElement The document's <head> element.
*/
protected function ensure_head_is_present() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is, in fact, needed, because now we prevent processing responses when no <head> is present:

/*
* Abort if the response was not HTML. To be post-processed as an AMP page, the output-buffered document must
* have the HTML mime type and it must start with <html> followed by <head> tag (with whitespace, doctype, and comments optionally interspersed).
*/
if ( 'text/html' !== substr( AMP_HTTP::get_response_content_type(), 0, 9 ) || ! preg_match( '#^(?:<!.*?>|\s+)*<html.*?>(?:<!.*?>|\s+)*<head\b(.*?)>#is', $response ) ) {
return $response;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should treat the sanitizers independently from the AMP_Theme_Support processing here, as we want to produce a stand-alone sanitizer library in the long term.

While it is perfectly fine to do early bails in AMP_Theme_Support for various reasons as part of the "WP integration" of the plugin, I think that "asserting" the requirements in the sanitizer should still be done nevertheless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good call. 👍

includes/sanitizers/class-amp-meta-sanitizer.php Outdated Show resolved Hide resolved
includes/sanitizers/class-amp-meta-sanitizer.php Outdated Show resolved Hide resolved
return;
}

$this->meta_tags[ self::TAG_CHARSET ][] = $this->create_charset_element( $charset ?: static::AMP_CHARSET );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can just always use static::AMP_CHARSET because this is the only thing that AMP allows (utf-8).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but only if we do indeed transform the document first. Otherwise, we're not only in the wrong charset, but we've also lost the information about what the actual charset is => broken^2. :)

Comment on lines 210 to 163
protected function ensure_viewport_is_present() {
if ( empty( $this->meta_tags[ self::TAG_VIEWPORT ] ) ) {
$this->meta_tags[ self::TAG_VIEWPORT ][] = $this->create_viewport_element( static::AMP_VIEWPORT );
return;
}

// Ensure we have the 'width=device-width' setting included.
$viewport_tag = $this->meta_tags[ self::TAG_VIEWPORT ][0];
$viewport_content = $viewport_tag->getAttribute( 'content' );
$viewport_settings = array_map( 'trim', explode( ',', $viewport_content ) );
$width_found = false;

foreach ( $viewport_settings as $index => $viewport_setting ) {
list( $property, $value ) = array_map( 'trim', explode( '=', $viewport_setting ) );
if ( 'width' === $property ) {
if ( 'device-width' !== $value ) {
$viewport_settings[ $index ] = 'width=device-width';
}
$width_found = true;
break;
}
}

if ( ! $width_found ) {
array_unshift( $viewport_settings, 'width=device-width' );
}

$viewport_tag->setAttribute( 'content', implode( ',', $viewport_settings ) );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to obtain the tag spec for this meta tag essentially incorporate this logic:

/**
* Check if attribute has valid properties.
*
* @since 0.7
*
* @param DOMElement $node Node.
* @param string $attr_name Attribute name.
* @param array[]|string[] $attr_spec_rule Attribute spec rule.
*
* @return string:
* - AMP_Rule_Spec::PASS - $attr_name has a value that matches the rule.
* - AMP_Rule_Spec::FAIL - $attr_name has a value that does *not* match rule.
* - AMP_Rule_Spec::NOT_APPLICABLE - $attr_name does not exist or there
* is no rule for this attribute.
*/
private function check_attr_spec_rule_value_properties( DOMElement $node, $attr_name, $attr_spec_rule ) {
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) && $node->hasAttribute( $attr_name ) ) {
$properties = [];
foreach ( explode( ',', $node->getAttribute( $attr_name ) ) as $pair ) {
$pair_parts = explode( '=', $pair, 2 );
if ( 2 !== count( $pair_parts ) ) {
return 0;
}
$properties[ strtolower( trim( $pair_parts[0] ) ) ] = trim( $pair_parts[1] );
}
// Fail if there are unrecognized properties.
if ( count( array_diff( array_keys( $properties ), array_keys( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) ) ) > 0 ) {
return AMP_Rule_Spec::FAIL;
}
foreach ( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] as $prop_name => $property_spec ) {
// Mandatory property is missing.
if ( ! empty( $property_spec['mandatory'] ) && ! isset( $properties[ $prop_name ] ) ) {
return AMP_Rule_Spec::FAIL;
}
if ( ! isset( $properties[ $prop_name ] ) ) {
continue;
}
$prop_value = $properties[ $prop_name ];
// Required value is absent, so fail.
$required_value = null;
if ( isset( $property_spec['value'] ) ) {
$required_value = $property_spec['value'];
} elseif ( isset( $property_spec['value_double'] ) ) {
$required_value = $property_spec['value_double'];
$prop_value = (float) $prop_value;
}
if ( isset( $required_value ) && $prop_value !== $required_value ) {
return AMP_Rule_Spec::FAIL;
}
}
return AMP_Rule_Spec::PASS;
}
return AMP_Rule_Spec::NOT_APPLICABLE;
}

Otherwise, if someone creates a meta tag like:

<meta name=viewport content="width=content-width,BOGUS=SUPER_BOGUS">

Then the tag-and-attribute sanitizer will end up removing this meta tag that you created here.

In order to facilitate obtaining the tag spec for meta name=viewport, instead of iterating over all meta tags it and finding the one that has that spec_name, it would be useful to perhaps to refactor the spec generation logic to use the spec_name as the array key, like so:

--- a/includes/sanitizers/class-amp-allowed-tags-generated.php
+++ b/includes/sanitizers/class-amp-allowed-tags-generated.php
@@ -10106,7 +10106,7 @@ class AMP_Allowed_Tags_Generated {
 					'unique' => true,
 				),
 			),
-			array(
+			'meta name=viewport' => array(
 				'attr_spec_list' => array(
 					'content' => array(
 						'mandatory' => true,
@@ -10135,7 +10135,6 @@ class AMP_Allowed_Tags_Generated {
 				'tag_spec' => array(
 					'mandatory' => true,
 					'mandatory_parent' => 'head',
-					'spec_name' => 'meta name=viewport',
 					'spec_url' => 'https://amp.dev/documentation/guides-and-tutorials/learn/spec/amphtml#required-markup',
 					'unique' => true,
 				),

Then we could extend the \AMP_Allowed_Tags_Generated::get_allowed_tag() with a new $spec_name arg:

--- a/includes/sanitizers/class-amp-allowed-tags-generated.php
+++ b/includes/sanitizers/class-amp-allowed-tags-generated.php
@@ -17625,10 +17625,13 @@ class AMP_Allowed_Tags_Generated {
 	 *
 	 * @since 0.7
 	 * @param string $node_name Tag name.
+	 * @param string $spec_name Spec name, to reduce the tag specs to a single one.
 	 * @return array|null Allowed tag, or null if the tag does not exist.
 	 */
-	public static function get_allowed_tag( $node_name ) {
-		if ( isset( self::$allowed_tags[ $node_name ] ) ) {
+	public static function get_allowed_tag( $node_name, $spec_name = null ) {
+		if ( isset( $spec_name, self::$allowed_tags[ $node_name ][ $spec_name ] ) ) {
+			return self::$allowed_tags[ $node_name ][ $spec_name ];
+		} elseif ( isset( self::$allowed_tags[ $node_name ] ) ) {
 			return self::$allowed_tags[ $node_name ];
 		}
 		return null;

And then here in the ensure_viewport_is_present it could easily locate the tag spec via:

$tag_spec = \AMP_Allowed_Tags_Generated::get_allowed_tag( 'meta', 'meta name=viewport' );

This would be useful elsewhere that we refer to tag specs, including in the style sanitizer.

Copy link
Collaborator Author

@schlessera schlessera Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, would it be possible to refactor the tag & meta sanitizer to strip the offending properties only, and then check if the entire thing is still valid or not? This way, we wouldn't have to duplicate this logic all over the place...

it would be useful to perhaps to refactor the spec generation logic to use the spec_name as the array key

Is this guaranteed not to have collisions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, would it be possible to refactor the tag & meta sanitizer to strip the offending properties only, and then check if the entire thing is still valid or not? This way, we wouldn't have to duplicate this logic all over the place...

Yes, this could be done. There could be validation errors with type invalid_meta_property. I like that.

Is this guaranteed not to have collisions?

It turns out, yes! Sometimes the spec_name is omitted when it is just a regular HTML element, I think. And for script we just need to derive a spec_name from the extension_spec. When this is done, all tag specs are unique: https://gist.github.com/westonruter/8e6a8ca427c69b23cdd8e26dcccbfc3d

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during the plugin sync, I'll tackle the enhanced validation-stripping-granularity in a separate PR to keep this one moderately sane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: #3780. May be best to wait to work on this until that PR is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I can include the property sanitization as part of #3780.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather a subsequent PR built off of that PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a new issue to track this: #4070

// Force-add http-equiv charset to make DOMDocument behave as it should.
// See: http://php.net/manual/en/domdocument.loadhtml.php#78243.
$source = str_replace(
'<head>',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if $source has a <head> that hsa attributes like <head profile="http://www.acme.com/profiles/core">

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified this to use a regex to match any head tag. However, tests revealed now that DOMDocument seems to strip all attributes from the head tag...

Comment on lines 149 to 158
if ( false === strpos( $substring, '<head>' ) ) {
// Create the required HTML structure if none exists yet.
$content = "<html><head></head><body>{$content}</body></html>";
} else {
// <head> seems to be present without <body>.
$content = preg_replace( '#</head>(.*)</html>#', '</head><body>$1</body>', $content );
}
} elseif ( false === strpos( $substring, '<head>' ) ) {
// Create a <head> element if none exists yet.
$content = str_replace( '<body', '<head></head><body', $content );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if <head> is present but it contains attributes, like:

<head profile='http://www.acme.com/profiles/core'>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will change to a preg_replace().

);
$head->insertBefore( $charset, $head->firstChild );

return str_replace( '<meta http-equiv="content-type" content="text/html; charset=' . self::AMP_ENCODING . '">', '', parent::saveHTML( $node ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replacement key makes me a bit nervous, as it means we have have to rely on libxml always serializing attributes the same way. As long as we have tests for it, then I suppose nothing to worry about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tag with two string-based attributes. I'm not sure how far off this could get.

However, I'll replace it with a preg_replace() to make it case-insensitive and "quote-insensitive".

if ( $success ) {
$this->encoding = self::AMP_ENCODING;
$head = $this->getElementsByTagName( 'head' )->item( 0 );
$head->removeChild( $head->firstChild );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the above str_replace() didn't actually do any replacement. It could remove the wrong node here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a safeguard around it.

$success = parent::loadHTML( $source, $options );

if ( $success ) {
$this->encoding = self::AMP_ENCODING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't DOMDocument populate this? Why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one of the many bits I found online that might or might not improve DOMDocument behavior with UTF-8.

However, I don't think this does much, so I'll remove it.

Comment on lines 173 to 183
$this->original_encoding = mb_detect_encoding( $source );
}

// Guessing the encoding seems to have failed, so we assume UTF-8 instead.
if ( empty( $this->original_encoding ) ) {
$this->original_encoding = self::AMP_ENCODING;
}

$this->original_encoding = $this->sanitize_encoding( $this->original_encoding );

$target = mb_convert_encoding( $source, self::AMP_ENCODING, $this->original_encoding );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to short-circuit if the mbstring extension is not loaded, since we do not currently include it among the $_amp_required_extensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I suggest that if a document is not utf-8 and we cannot convert into utf-8 that we throw an exception. AMP would just not be available.

Comment on lines 211 to 212
$http_equiv_tag && str_replace( $http_equiv_tag, '', $content );
$charset_tag && str_replace( $charset_tag, '', $content );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be unused expressions. Shouldn't they be assigned to something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are string replacements that only happen when the strings evaluate to true. Too cryptic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string replacements aren't done in place though, as $content is not passed by reference. I would expect something like this:

if ( $charset_tag ) {
    $content = str_replace( $charset_tag, '', $content );
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is nonsense. I'll change it.

Comment on lines 304 to 411
case 'xpath':
$this->xpath = new DOMXPath( $this );
return $this->xpath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

*
* @since 1.5
*
* @property DOMXpath $xpath XPath query object for this document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @property DOMXpath $xpath XPath query object for this document.
* @property DOMXPath $xpath XPath query object for this document.

includes/utils/class-amp-dom-document.php Outdated Show resolved Hide resolved
@schlessera
Copy link
Collaborator Author

Yes, some more hardening is needed here, this was the first attempt that finally passed all the tests.

Also, I need write a lot of additional tests for the new code, move some tests over, and think about how to best test the actual encoding conversion.

@schlessera schlessera changed the title [WIP] Add meta tag sanitizer [WIP] Add AMP_DOM_Document & meta tag sanitizer Nov 27, 2019
@schlessera
Copy link
Collaborator Author

There's still a few minor kinks to figure out, but I think I've got general automated character set conversion working.

Also, the new AMP_DOM_Document provides a way of centralizing some operations that are reused again and again, saving some processing time. I've done a preliminary change for fetching $xpath, $head and $body for some classes now.

@@ -96,6 +97,7 @@ class AMP_Autoloader {
'AMP_Content' => 'includes/templates/class-amp-content',
'AMP_Content_Sanitizer' => 'includes/templates/class-amp-content-sanitizer',
'AMP_Post_Template' => 'includes/templates/class-amp-post-template',
'AMP_DOM_Document' => 'includes/utils/class-amp-dom-document',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into the Amp\AmpWP namespace now that we have it? Or is this premature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it makes sense, if we can agree that we'll use this new abstraction everywhere going forward.

I assume it should be Amp\AmpWP\DOMDocument, due to its importance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, or what about Amp\AmpWP\DOM\Document?

Copy link
Collaborator Author

@schlessera schlessera Nov 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a subnamespace like that, we would also add DOMElementList in there as well.

It would allow us to important the namespace only and then use the relative DOM subnamespace in the code, if we want:

use Amp\AmpWP\DOM;

$dom  = new DOM\Document();
$list = ( new DOM\ElementList() )
    ->add( $image1 )
    ->add( $image2 );

However, we might want to go with Amp\AmpWP\Dom\Document to match Amp. It would make the result less "shouty".

includes/class-amp-theme-support.php Show resolved Hide resolved
Comment on lines 24 to 31
/**
* Placeholder for default arguments, to be set in child classes.
*
* @var array
*/
protected $DEFAULT_ARGS = [ // phpcs:ignore WordPress.NamingConventions.ValidVariableName.PropertyNotSnakeCase
'use_document_element' => true, // We want to work on the header, so we need the entire document.
];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this will have any practical effect. Since we're not referencing $this-args['use_document_element'] in the sanitizer, this is just setting a member which is going to be either unused or just overridden in the first place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this can be removed now.

Comment on lines 210 to 163
protected function ensure_viewport_is_present() {
if ( empty( $this->meta_tags[ self::TAG_VIEWPORT ] ) ) {
$this->meta_tags[ self::TAG_VIEWPORT ][] = $this->create_viewport_element( static::AMP_VIEWPORT );
return;
}

// Ensure we have the 'width=device-width' setting included.
$viewport_tag = $this->meta_tags[ self::TAG_VIEWPORT ][0];
$viewport_content = $viewport_tag->getAttribute( 'content' );
$viewport_settings = array_map( 'trim', explode( ',', $viewport_content ) );
$width_found = false;

foreach ( $viewport_settings as $index => $viewport_setting ) {
list( $property, $value ) = array_map( 'trim', explode( '=', $viewport_setting ) );
if ( 'width' === $property ) {
if ( 'device-width' !== $value ) {
$viewport_settings[ $index ] = 'width=device-width';
}
$width_found = true;
break;
}
}

if ( ! $width_found ) {
array_unshift( $viewport_settings, 'width=device-width' );
}

$viewport_tag->setAttribute( 'content', implode( ',', $viewport_settings ) );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather a subsequent PR built off of that PR.

Comment on lines 413 to 416
$this->head = $this->getElementsByTagName( 'head' )->item( 0 );
return $this->head;
case 'body':
$this->body = $this->getElementsByTagName( 'body' )->item( 0 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If $this->head or $this->body are null, should this create those elements just in time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. We're not defining the internal properties $head and $body, so the first time around, they don't exist and a call to $this->head will trigger the magic __get(). Within the magic getter, we dynamically set the property $head, so the next call will immediately retrieve it and not hit the magic getter anymore.

}

// Mimic regular PHP behavior for missing notices.
trigger_error( "Undefined property: AMP_DOM_Document::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trigger_error( "Undefined property: AMP_DOM_Document::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput
trigger_error( "Undefined property: " . __CLASS__ . "::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput

Comment on lines -186 to -196
// Make sure there is a head and a body.
$head = $dom->getElementsByTagName( 'head' )->item( 0 );
if ( ! $head ) {
$head = $dom->createElement( 'head' );
$dom->documentElement->insertBefore( $head, $dom->documentElement->firstChild );
}
$body = $dom->getElementsByTagName( 'body' )->item( 0 );
if ( ! $body ) {
$body = $dom->createElement( 'body' );
$dom->documentElement->appendChild( $body );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ensuring of the body and head being done now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMP_DOM_Document now enforces a base structure:

/**
* Normalize the document structure.
*
* This makes sure the document adheres to the general structure that AMP requires:
* ```
* <!doctype html>
* <html>
* <head>
* <meta charset="utf-8">
* </head>
* <body>
* </body>
* </html>
* ```
*
* @param string $content Content to normalize the structure of.
* @return string Normalized content.
*/
private function normalize_document_structure( $content ) {

@westonruter
Copy link
Member

westonruter commented Nov 29, 2019 via email

@schlessera
Copy link
Collaborator Author

⚠️ Rebasing now to get access to the new namespace code.

@schlessera
Copy link
Collaborator Author

⚠️ Rebasing now to get resolve conflicts.

@schlessera schlessera force-pushed the fix/3469-convert-http-equiv branch from bbf08a3 to 4bcad65 Compare December 9, 2019 17:08
@pierlon
Copy link
Contributor

pierlon commented Dec 11, 2019

Hey @schlessera, reverting 4b39169 does resolve the failing Twitter embed tests, but I'm not sure as to why they are failing in the first place as yet.

src/Dom/Document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
tests/php/test-class-amp-dom-document.php Show resolved Hide resolved
tests/php/test-class-amp-dom-document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
tests/php/test-class-amp-dom-document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
@westonruter
Copy link
Member

Alpha build for testing: amp.zip (v1.5.0-alpha-20191218T231919Z-8cbe6d60)

@westonruter
Copy link
Member

  • Cloning the Dom\Document can lead to such edge cases, so this is also something to use with caution until we were able to play around more with it.

Wouldn't this be addressed by implementing __clone()? For example:

function __clone() {
    unset( $this->xpath, $this->head, $this->body );
}

This will reset those properties so that the next time the getter is invoked, they will get re-populated.

@schlessera
Copy link
Collaborator Author

  • The <meta charset="utf-8"> is not strictly necessary according to the AMP spec, but I think it is a good practice to just always have it around, especially as we now might have a different charset for the AMP page than for the non-AMP page.

Actually, it is required according to the spec:

Ah, yes, I mixed stuff up in my head. In terms of parsing the DOMDocument, it is not required (as it doesn't work properly, actually, and DOMDocument needs an HTML4 http-equiv) and adding it would normally happen in sanitizers.

But, still, I think it is good to have here, and when we agree it is worth changing the existing tests for that, then its settled.

@schlessera
Copy link
Collaborator Author

  • Cloning the Dom\Document can lead to such edge cases, so this is also something to use with caution until we were able to play around more with it.

Wouldn't this be addressed by implementing __clone()? For example:

function __clone() {
    unset( $this->xpath, $this->head, $this->body );
}

This will reset those properties so that the next time the getter is invoked, they will get re-populated.

I thought about that but was wondering how far that rabbit-hole might take me if the DOMDocument does by itself already cause issues. But I can try adding this and reverting the test change to see what we'll get.

@schlessera
Copy link
Collaborator Author

@westonruter All of your feedback from the review should be addressed now. Also, I got rid of the dependency on AMP_DOM_Utils in Dom\Document, as I think that doesn't make sense and creates a circular dependency.

src/Dom/Document.php Outdated Show resolved Hide resolved
tests/php/test-class-amp-dom-document.php Outdated Show resolved Hide resolved
@schlessera schlessera force-pushed the fix/3469-convert-http-equiv branch from b70ea78 to c739df0 Compare December 19, 2019 14:33
@schlessera
Copy link
Collaborator Author

Note: I force-pushed because I messed up a commit.

@westonruter
Copy link
Member

Great job on this large effort.

Comment on lines +89 to +95
const HTML_STRUCTURE_DOCTYPE_PATTERN = '/^[^<]*<!doctype(?:\s+[^>]+)?>/i';
const HTML_STRUCTURE_HTML_START_TAG = '/^[^<]*(?<html_start><html(?:\s+[^>]*)?>)/i';
const HTML_STRUCTURE_HTML_END_TAG = '/(?:<\/html(?:\s+[^>]*)?>)[^<>]*$/i';
const HTML_STRUCTURE_HEAD_START_TAG = '/^[^<]*(?:<head(?:\s+[^>]*)?>)/i';
const HTML_STRUCTURE_BODY_START_TAG = '/^[^<]*(?:<body(?:\s+[^>]*)?>)/i';
const HTML_STRUCTURE_BODY_END_TAG = '/(?:<\/body(?:\s+[^>]*)?>)[^<>]*$/i';
const HTML_STRUCTURE_HEAD_TAG = '/^(?:[^<]*(?:<head(?:\s+[^>]*)?>).*?<\/head(?:\s+[^>]*)?>)/is';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These patterns aren't all accounting for the inclusion of HTML comments which are added when doing validation requests. This results in HTML documents being corrupted when validating. See #4104.

* Sanitize.
*/
public function sanitize() {
$elements = $this->dom->getElementsByTagName( static::$tag );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused a bug. See #4502.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working cla: yes Signed the Google CLA Sanitizers
Projects
None yet
4 participants