-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AMP_DOM_Document & meta tag sanitizer #3758
Conversation
Note to self: don't push into a PR you've initiated but haven't created yet, it will force-create an empty one. |
We already have most of the meta tag logic in Right now, the above code seems to fix the main bug we were chasing, but it introduces duplicate code. I'll make some more changes now to extract the meta tag handling out of the |
Note: the code in the amp-wp/includes/class-amp-theme-support.php Line 1536 in 3b70f6d
|
What was supposed to be a small refactor here turned out to be more of a mess than expected. I started with a I don't want to spend much more time on refactoring here, as I've already collected higher-level thoughts about this in #3763. I think one principle we should adhere to in future refactorings is to split sanitization (changing from incorrect into correct) from optimization (reordering and similar). The reordering is what not only causes the entire |
There's two open issues here: What do we do if we encounter a charset other than Do we try to convert the entire content encoding? Do we work on a known subset only that we can test? Do we just throw an error and force the user to deal with this before trying to use AMP? What are the requirements for the viewport? As already stated in Slack, the existing tests currently expect So, given the input |
Here are the allowed properties as well as the required You're right that it |
The PR already does the merging, I just wanted to verify with you whether I got the requirements right because of the broken test. |
Throwing a warning other than what we're currently doing? amp-wp/includes/class-amp-theme-support.php Lines 2361 to 2365 in f157602
I think we should try to convert the entire encoding if possible. I was thinking that something like this should be possible (untested code): diff --git a/includes/class-amp-theme-support.php b/includes/class-amp-theme-support.php
index 11b7d9a3..fb518e4c 100644
--- a/includes/class-amp-theme-support.php
+++ b/includes/class-amp-theme-support.php
@@ -2249,6 +2249,16 @@ class AMP_Theme_Support {
);
}
+ // AMP requires UTF-8, so convert encoding.
+ if ( strtolower( get_bloginfo( 'charset' ) ) !== 'utf-8' ) {
+ if ( function_exists( 'mb_convert_encoding' ) ) {
+ $response = mb_convert_encoding( $response, 'utf-8', get_bloginfo( 'charset' ) );
+ } else {
+ /* translators: %s: the charset of the current site. */
+ trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error
+ }
+ }
+
$dom = AMP_DOM_Utils::get_dom( $response );
$xpath = new DOMXPath( $dom );
$head = $dom->getElementsByTagName( 'head' )->item( 0 );
@@ -2358,12 +2368,6 @@ class AMP_Theme_Support {
}
}
- // @todo If 'utf-8' is not the blog charset, then we'll need to do some character encoding conversation or "entityification".
- if ( 'utf-8' !== strtolower( get_bloginfo( 'charset' ) ) ) {
- /* translators: %s: the charset of the current site. */
- trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error
- }
-
AMP_Validation_Manager::finalize_validation(
$dom,
[ However, last I tried it I don't recall having success. It has been awhile however. Even so, if UTF-8 is not the encoding, we should still at least issue a warning in Site Health. |
|
||
$this->ensure_charset_is_present( $charset ); | ||
|
||
if ( ! $this->is_correct_charset() ) { // phpcs:ignore Generic.CodeAnalysis.EmptyStatement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will need to be done prior to the DOMDocument
being constructed, but it didn't occur to me that perhaps DOMDocument
allows for encoding to be changed after parsing. I don't think it facilitates this, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DOMDocument works with UTF-8 internally, so I think we need to have it converted before sending it to DOMDocument, or otherwise we might already have messed it up via loadHTML()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then is this is_correct_ charset()
check needed then? Re-encoding it into UTF-8 cannot be done at this point. It would need to be done earlier. This sanitizer just needs to make sure that the utf-8
meta charset is present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've put this into AMP_DOM_Document
now.
static function ( $element ) { | ||
return $element->parentNode->removeChild( $element ); | ||
}, | ||
iterator_to_array( $elements, false ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL! We could be using this lots of places where we're currently iterating over a DOMNodeList
to push onto an array which we then iterate over to potentially remove elements. Removing elements from the DOM while iterating over a DOMNodeList
is problematic since it is a live node list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the DOMNodeList
does not seem to have a built-in way to turn itself into an array, but any iterator can be turned into one via iterator_to_array()
.
Array/Traversable/Iterable handling in PHP is just a big mess, unfortunately.
* | ||
* @return DOMElement The document's <head> element. | ||
*/ | ||
protected function ensure_head_is_present() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is, in fact, needed, because now we prevent processing responses when no <head>
is present:
amp-wp/includes/class-amp-theme-support.php
Lines 2052 to 2058 in f157602
/* | |
* Abort if the response was not HTML. To be post-processed as an AMP page, the output-buffered document must | |
* have the HTML mime type and it must start with <html> followed by <head> tag (with whitespace, doctype, and comments optionally interspersed). | |
*/ | |
if ( 'text/html' !== substr( AMP_HTTP::get_response_content_type(), 0, 9 ) || ! preg_match( '#^(?:<!.*?>|\s+)*<html.*?>(?:<!.*?>|\s+)*<head\b(.*?)>#is', $response ) ) { | |
return $response; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should treat the sanitizers independently from the AMP_Theme_Support
processing here, as we want to produce a stand-alone sanitizer library in the long term.
While it is perfectly fine to do early bails in AMP_Theme_Support
for various reasons as part of the "WP integration" of the plugin, I think that "asserting" the requirements in the sanitizer should still be done nevertheless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, good call. 👍
return; | ||
} | ||
|
||
$this->meta_tags[ self::TAG_CHARSET ][] = $this->create_charset_element( $charset ?: static::AMP_CHARSET ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can just always use static::AMP_CHARSET
because this is the only thing that AMP allows (utf-8
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but only if we do indeed transform the document first. Otherwise, we're not only in the wrong charset, but we've also lost the information about what the actual charset is => broken^2. :)
protected function ensure_viewport_is_present() { | ||
if ( empty( $this->meta_tags[ self::TAG_VIEWPORT ] ) ) { | ||
$this->meta_tags[ self::TAG_VIEWPORT ][] = $this->create_viewport_element( static::AMP_VIEWPORT ); | ||
return; | ||
} | ||
|
||
// Ensure we have the 'width=device-width' setting included. | ||
$viewport_tag = $this->meta_tags[ self::TAG_VIEWPORT ][0]; | ||
$viewport_content = $viewport_tag->getAttribute( 'content' ); | ||
$viewport_settings = array_map( 'trim', explode( ',', $viewport_content ) ); | ||
$width_found = false; | ||
|
||
foreach ( $viewport_settings as $index => $viewport_setting ) { | ||
list( $property, $value ) = array_map( 'trim', explode( '=', $viewport_setting ) ); | ||
if ( 'width' === $property ) { | ||
if ( 'device-width' !== $value ) { | ||
$viewport_settings[ $index ] = 'width=device-width'; | ||
} | ||
$width_found = true; | ||
break; | ||
} | ||
} | ||
|
||
if ( ! $width_found ) { | ||
array_unshift( $viewport_settings, 'width=device-width' ); | ||
} | ||
|
||
$viewport_tag->setAttribute( 'content', implode( ',', $viewport_settings ) ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to obtain the tag spec for this meta tag essentially incorporate this logic:
amp-wp/includes/sanitizers/class-amp-tag-and-attribute-sanitizer.php
Lines 1593 to 1652 in f157602
/** | |
* Check if attribute has valid properties. | |
* | |
* @since 0.7 | |
* | |
* @param DOMElement $node Node. | |
* @param string $attr_name Attribute name. | |
* @param array[]|string[] $attr_spec_rule Attribute spec rule. | |
* | |
* @return string: | |
* - AMP_Rule_Spec::PASS - $attr_name has a value that matches the rule. | |
* - AMP_Rule_Spec::FAIL - $attr_name has a value that does *not* match rule. | |
* - AMP_Rule_Spec::NOT_APPLICABLE - $attr_name does not exist or there | |
* is no rule for this attribute. | |
*/ | |
private function check_attr_spec_rule_value_properties( DOMElement $node, $attr_name, $attr_spec_rule ) { | |
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) && $node->hasAttribute( $attr_name ) ) { | |
$properties = []; | |
foreach ( explode( ',', $node->getAttribute( $attr_name ) ) as $pair ) { | |
$pair_parts = explode( '=', $pair, 2 ); | |
if ( 2 !== count( $pair_parts ) ) { | |
return 0; | |
} | |
$properties[ strtolower( trim( $pair_parts[0] ) ) ] = trim( $pair_parts[1] ); | |
} | |
// Fail if there are unrecognized properties. | |
if ( count( array_diff( array_keys( $properties ), array_keys( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) ) ) > 0 ) { | |
return AMP_Rule_Spec::FAIL; | |
} | |
foreach ( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] as $prop_name => $property_spec ) { | |
// Mandatory property is missing. | |
if ( ! empty( $property_spec['mandatory'] ) && ! isset( $properties[ $prop_name ] ) ) { | |
return AMP_Rule_Spec::FAIL; | |
} | |
if ( ! isset( $properties[ $prop_name ] ) ) { | |
continue; | |
} | |
$prop_value = $properties[ $prop_name ]; | |
// Required value is absent, so fail. | |
$required_value = null; | |
if ( isset( $property_spec['value'] ) ) { | |
$required_value = $property_spec['value']; | |
} elseif ( isset( $property_spec['value_double'] ) ) { | |
$required_value = $property_spec['value_double']; | |
$prop_value = (float) $prop_value; | |
} | |
if ( isset( $required_value ) && $prop_value !== $required_value ) { | |
return AMP_Rule_Spec::FAIL; | |
} | |
} | |
return AMP_Rule_Spec::PASS; | |
} | |
return AMP_Rule_Spec::NOT_APPLICABLE; | |
} |
Otherwise, if someone creates a meta tag like:
<meta name=viewport content="width=content-width,BOGUS=SUPER_BOGUS">
Then the tag-and-attribute sanitizer will end up removing this meta
tag that you created here.
In order to facilitate obtaining the tag spec for meta name=viewport
, instead of iterating over all meta
tags it and finding the one that has that spec_name
, it would be useful to perhaps to refactor the spec generation logic to use the spec_name
as the array key, like so:
--- a/includes/sanitizers/class-amp-allowed-tags-generated.php
+++ b/includes/sanitizers/class-amp-allowed-tags-generated.php
@@ -10106,7 +10106,7 @@ class AMP_Allowed_Tags_Generated {
'unique' => true,
),
),
- array(
+ 'meta name=viewport' => array(
'attr_spec_list' => array(
'content' => array(
'mandatory' => true,
@@ -10135,7 +10135,6 @@ class AMP_Allowed_Tags_Generated {
'tag_spec' => array(
'mandatory' => true,
'mandatory_parent' => 'head',
- 'spec_name' => 'meta name=viewport',
'spec_url' => 'https://amp.dev/documentation/guides-and-tutorials/learn/spec/amphtml#required-markup',
'unique' => true,
),
Then we could extend the \AMP_Allowed_Tags_Generated::get_allowed_tag()
with a new $spec_name
arg:
--- a/includes/sanitizers/class-amp-allowed-tags-generated.php
+++ b/includes/sanitizers/class-amp-allowed-tags-generated.php
@@ -17625,10 +17625,13 @@ class AMP_Allowed_Tags_Generated {
*
* @since 0.7
* @param string $node_name Tag name.
+ * @param string $spec_name Spec name, to reduce the tag specs to a single one.
* @return array|null Allowed tag, or null if the tag does not exist.
*/
- public static function get_allowed_tag( $node_name ) {
- if ( isset( self::$allowed_tags[ $node_name ] ) ) {
+ public static function get_allowed_tag( $node_name, $spec_name = null ) {
+ if ( isset( $spec_name, self::$allowed_tags[ $node_name ][ $spec_name ] ) ) {
+ return self::$allowed_tags[ $node_name ][ $spec_name ];
+ } elseif ( isset( self::$allowed_tags[ $node_name ] ) ) {
return self::$allowed_tags[ $node_name ];
}
return null;
And then here in the ensure_viewport_is_present
it could easily locate the tag spec via:
$tag_spec = \AMP_Allowed_Tags_Generated::get_allowed_tag( 'meta', 'meta name=viewport' );
This would be useful elsewhere that we refer to tag specs, including in the style sanitizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, would it be possible to refactor the tag & meta sanitizer to strip the offending properties only, and then check if the entire thing is still valid or not? This way, we wouldn't have to duplicate this logic all over the place...
it would be useful to perhaps to refactor the spec generation logic to use the spec_name as the array key
Is this guaranteed not to have collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, would it be possible to refactor the tag & meta sanitizer to strip the offending properties only, and then check if the entire thing is still valid or not? This way, we wouldn't have to duplicate this logic all over the place...
Yes, this could be done. There could be validation errors with type invalid_meta_property
. I like that.
Is this guaranteed not to have collisions?
It turns out, yes! Sometimes the spec_name
is omitted when it is just a regular HTML element, I think. And for script
we just need to derive a spec_name
from the extension_spec
. When this is done, all tag specs are unique: https://gist.github.com/westonruter/8e6a8ca427c69b23cdd8e26dcccbfc3d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during the plugin sync, I'll tackle the enhanced validation-stripping-granularity in a separate PR to keep this one moderately sane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: #3780. May be best to wait to work on this until that PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I can include the property sanitization as part of #3780.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather a subsequent PR built off of that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a new issue to track this: #4070
// Force-add http-equiv charset to make DOMDocument behave as it should. | ||
// See: http://php.net/manual/en/domdocument.loadhtml.php#78243. | ||
$source = str_replace( | ||
'<head>', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if $source
has a <head>
that hsa attributes like <head profile="http://www.acme.com/profiles/core">
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified this to use a regex to match any head tag. However, tests revealed now that DOMDocument
seems to strip all attributes from the head tag...
if ( false === strpos( $substring, '<head>' ) ) { | ||
// Create the required HTML structure if none exists yet. | ||
$content = "<html><head></head><body>{$content}</body></html>"; | ||
} else { | ||
// <head> seems to be present without <body>. | ||
$content = preg_replace( '#</head>(.*)</html>#', '</head><body>$1</body>', $content ); | ||
} | ||
} elseif ( false === strpos( $substring, '<head>' ) ) { | ||
// Create a <head> element if none exists yet. | ||
$content = str_replace( '<body', '<head></head><body', $content ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if <head>
is present but it contains attributes, like:
<head profile='http://www.acme.com/profiles/core'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will change to a preg_replace()
.
); | ||
$head->insertBefore( $charset, $head->firstChild ); | ||
|
||
return str_replace( '<meta http-equiv="content-type" content="text/html; charset=' . self::AMP_ENCODING . '">', '', parent::saveHTML( $node ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replacement key makes me a bit nervous, as it means we have have to rely on libxml always serializing attributes the same way. As long as we have tests for it, then I suppose nothing to worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tag with two string-based attributes. I'm not sure how far off this could get.
However, I'll replace it with a preg_replace()
to make it case-insensitive and "quote-insensitive".
if ( $success ) { | ||
$this->encoding = self::AMP_ENCODING; | ||
$head = $this->getElementsByTagName( 'head' )->item( 0 ); | ||
$head->removeChild( $head->firstChild ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the above str_replace()
didn't actually do any replacement. It could remove the wrong node here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a safeguard around it.
$success = parent::loadHTML( $source, $options ); | ||
|
||
if ( $success ) { | ||
$this->encoding = self::AMP_ENCODING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't DOMDocument
populate this? Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one of the many bits I found online that might or might not improve DOMDocument behavior with UTF-8.
However, I don't think this does much, so I'll remove it.
$this->original_encoding = mb_detect_encoding( $source ); | ||
} | ||
|
||
// Guessing the encoding seems to have failed, so we assume UTF-8 instead. | ||
if ( empty( $this->original_encoding ) ) { | ||
$this->original_encoding = self::AMP_ENCODING; | ||
} | ||
|
||
$this->original_encoding = $this->sanitize_encoding( $this->original_encoding ); | ||
|
||
$target = mb_convert_encoding( $source, self::AMP_ENCODING, $this->original_encoding ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to short-circuit if the mbstring
extension is not loaded, since we do not currently include it among the $_amp_required_extensions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I suggest that if a document is not utf-8
and we cannot convert into utf-8
that we throw an exception. AMP would just not be available.
$http_equiv_tag && str_replace( $http_equiv_tag, '', $content ); | ||
$charset_tag && str_replace( $charset_tag, '', $content ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem to be unused expressions. Shouldn't they be assigned to something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are string replacements that only happen when the strings evaluate to true
. Too cryptic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string replacements aren't done in place though, as $content
is not passed by reference. I would expect something like this:
if ( $charset_tag ) {
$content = str_replace( $charset_tag, '', $content );
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this is nonsense. I'll change it.
case 'xpath': | ||
$this->xpath = new DOMXPath( $this ); | ||
return $this->xpath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
* | ||
* @since 1.5 | ||
* | ||
* @property DOMXpath $xpath XPath query object for this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @property DOMXpath $xpath XPath query object for this document. | |
* @property DOMXPath $xpath XPath query object for this document. |
Yes, some more hardening is needed here, this was the first attempt that finally passed all the tests. Also, I need write a lot of additional tests for the new code, move some tests over, and think about how to best test the actual encoding conversion. |
There's still a few minor kinks to figure out, but I think I've got general automated character set conversion working. Also, the new |
includes/class-amp-autoloader.php
Outdated
@@ -96,6 +97,7 @@ class AMP_Autoloader { | |||
'AMP_Content' => 'includes/templates/class-amp-content', | |||
'AMP_Content_Sanitizer' => 'includes/templates/class-amp-content-sanitizer', | |||
'AMP_Post_Template' => 'includes/templates/class-amp-post-template', | |||
'AMP_DOM_Document' => 'includes/utils/class-amp-dom-document', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go into the Amp\AmpWP
namespace now that we have it? Or is this premature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think it makes sense, if we can agree that we'll use this new abstraction everywhere going forward.
I assume it should be Amp\AmpWP\DOMDocument
, due to its importance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, or what about Amp\AmpWP\DOM\Document
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a subnamespace like that, we would also add DOMElementList
in there as well.
It would allow us to important the namespace only and then use the relative DOM
subnamespace in the code, if we want:
use Amp\AmpWP\DOM;
$dom = new DOM\Document();
$list = ( new DOM\ElementList() )
->add( $image1 )
->add( $image2 );
However, we might want to go with Amp\AmpWP\Dom\Document
to match Amp
. It would make the result less "shouty".
/** | ||
* Placeholder for default arguments, to be set in child classes. | ||
* | ||
* @var array | ||
*/ | ||
protected $DEFAULT_ARGS = [ // phpcs:ignore WordPress.NamingConventions.ValidVariableName.PropertyNotSnakeCase | ||
'use_document_element' => true, // We want to work on the header, so we need the entire document. | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this will have any practical effect. Since we're not referencing $this-args['use_document_element']
in the sanitizer, this is just setting a member which is going to be either unused or just overridden in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this can be removed now.
protected function ensure_viewport_is_present() { | ||
if ( empty( $this->meta_tags[ self::TAG_VIEWPORT ] ) ) { | ||
$this->meta_tags[ self::TAG_VIEWPORT ][] = $this->create_viewport_element( static::AMP_VIEWPORT ); | ||
return; | ||
} | ||
|
||
// Ensure we have the 'width=device-width' setting included. | ||
$viewport_tag = $this->meta_tags[ self::TAG_VIEWPORT ][0]; | ||
$viewport_content = $viewport_tag->getAttribute( 'content' ); | ||
$viewport_settings = array_map( 'trim', explode( ',', $viewport_content ) ); | ||
$width_found = false; | ||
|
||
foreach ( $viewport_settings as $index => $viewport_setting ) { | ||
list( $property, $value ) = array_map( 'trim', explode( '=', $viewport_setting ) ); | ||
if ( 'width' === $property ) { | ||
if ( 'device-width' !== $value ) { | ||
$viewport_settings[ $index ] = 'width=device-width'; | ||
} | ||
$width_found = true; | ||
break; | ||
} | ||
} | ||
|
||
if ( ! $width_found ) { | ||
array_unshift( $viewport_settings, 'width=device-width' ); | ||
} | ||
|
||
$viewport_tag->setAttribute( 'content', implode( ',', $viewport_settings ) ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather a subsequent PR built off of that PR.
$this->head = $this->getElementsByTagName( 'head' )->item( 0 ); | ||
return $this->head; | ||
case 'body': | ||
$this->body = $this->getElementsByTagName( 'body' )->item( 0 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If $this->head
or $this->body
are null, should this create those elements just in time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. We're not defining the internal properties $head
and $body
, so the first time around, they don't exist and a call to $this->head
will trigger the magic __get()
. Within the magic getter, we dynamically set the property $head
, so the next call will immediately retrieve it and not hit the magic getter anymore.
} | ||
|
||
// Mimic regular PHP behavior for missing notices. | ||
trigger_error( "Undefined property: AMP_DOM_Document::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trigger_error( "Undefined property: AMP_DOM_Document::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput | |
trigger_error( "Undefined property: " . __CLASS__ . "::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput |
// Make sure there is a head and a body. | ||
$head = $dom->getElementsByTagName( 'head' )->item( 0 ); | ||
if ( ! $head ) { | ||
$head = $dom->createElement( 'head' ); | ||
$dom->documentElement->insertBefore( $head, $dom->documentElement->firstChild ); | ||
} | ||
$body = $dom->getElementsByTagName( 'body' )->item( 0 ); | ||
if ( ! $body ) { | ||
$body = $dom->createElement( 'body' ); | ||
$dom->documentElement->appendChild( $body ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ensuring of the body
and head
being done now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AMP_DOM_Document
now enforces a base structure:
amp-wp/includes/utils/class-amp-dom-document.php
Lines 170 to 188 in 7513e41
/** | |
* Normalize the document structure. | |
* | |
* This makes sure the document adheres to the general structure that AMP requires: | |
* ``` | |
* <!doctype html> | |
* <html> | |
* <head> | |
* <meta charset="utf-8"> | |
* </head> | |
* <body> | |
* </body> | |
* </html> | |
* ``` | |
* | |
* @param string $content Content to normalize the structure of. | |
* @return string Normalized content. | |
*/ | |
private function normalize_document_structure( $content ) { |
Sounds good to me.
|
|
7513e41
to
ed2ef59
Compare
|
bbf08a3
to
4bcad65
Compare
Hey @schlessera, reverting 4b39169 does resolve the failing Twitter embed tests, but I'm not sure as to why they are failing in the first place as yet. |
Alpha build for testing: amp.zip (v1.5.0-alpha-20191218T231919Z-8cbe6d60) |
Wouldn't this be addressed by implementing function __clone() {
unset( $this->xpath, $this->head, $this->body );
} This will reset those properties so that the next time the getter is invoked, they will get re-populated. |
Co-Authored-By: Weston Ruter <[email protected]>
Ah, yes, I mixed stuff up in my head. In terms of parsing the DOMDocument, it is not required (as it doesn't work properly, actually, and DOMDocument needs an HTML4 http-equiv) and adding it would normally happen in sanitizers. But, still, I think it is good to have here, and when we agree it is worth changing the existing tests for that, then its settled. |
I thought about that but was wondering how far that rabbit-hole might take me if the DOMDocument does by itself already cause issues. But I can try adding this and reverting the test change to see what we'll get. |
@westonruter All of your feedback from the review should be addressed now. Also, I got rid of the dependency on |
b70ea78
to
c739df0
Compare
Note: I force-pushed because I messed up a commit. |
Great job on this large effort. |
const HTML_STRUCTURE_DOCTYPE_PATTERN = '/^[^<]*<!doctype(?:\s+[^>]+)?>/i'; | ||
const HTML_STRUCTURE_HTML_START_TAG = '/^[^<]*(?<html_start><html(?:\s+[^>]*)?>)/i'; | ||
const HTML_STRUCTURE_HTML_END_TAG = '/(?:<\/html(?:\s+[^>]*)?>)[^<>]*$/i'; | ||
const HTML_STRUCTURE_HEAD_START_TAG = '/^[^<]*(?:<head(?:\s+[^>]*)?>)/i'; | ||
const HTML_STRUCTURE_BODY_START_TAG = '/^[^<]*(?:<body(?:\s+[^>]*)?>)/i'; | ||
const HTML_STRUCTURE_BODY_END_TAG = '/(?:<\/body(?:\s+[^>]*)?>)[^<>]*$/i'; | ||
const HTML_STRUCTURE_HEAD_TAG = '/^(?:[^<]*(?:<head(?:\s+[^>]*)?>).*?<\/head(?:\s+[^>]*)?>)/is'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These patterns aren't all accounting for the inclusion of HTML comments which are added when doing validation requests. This results in HTML documents being corrupted when validating. See #4104.
* Sanitize. | ||
*/ | ||
public function sanitize() { | ||
$elements = $this->dom->getElementsByTagName( static::$tag ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This caused a bug. See #4502.
Summary
This PR adds an abstraction for the DOM document and a new sanitizer
AMP_Meta_Sanitizer
that sanitizes meta tags in general, but more specifically for now the charset tag.Dom\Document
to abstract away charset and document structure requirements.Dom\Document::from_html()
to construct from markup (and deprecateAMP_DOM_Utils::get_dom()
).Dom\Document::from_node()
to construct from a node (and deprecateAMP_DOM_Utils::get_dom_from_content_node()
).<head>
and<body>
elements.<meta charset="<charset>">
format.utf-8
is not met.AMP_DOM_Utils
to internalDom\Document
methods and deprecate accordingly.Fixes #3469
Fixes #855
Checklist