Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve validating sanitizer with context for why element/attribute is invalid #3780

Merged
merged 41 commits into from
Dec 12, 2019

Conversation

westonruter
Copy link
Member

@westonruter westonruter commented Nov 16, 2019

Summary

  • Introduce fine-grained error codes. Instead of just invalid_element when a required attribute is missing, the more specific error ATTR_REQUIRED_BUT_MISSING is raised which explains why it is invalid. Similarly, when an attribute value is incorrect, instead of invalid_attribute occurring when an attribute value violates a regex pattern, a more specific INVALID_ATTR_VALUE_REGEX pattern is raised. Methods which check for whether a node is valid now return an error code on failure rather than just false. These fine-grained error codes will be used in a subsequent PR to add detailed messages, though at the moment just codes are used.
  • Remove redundant checks inAMP_Tag_And_Attribute_Sanitizer::validate_attr_spec_list_for_node() which prevent subsequent ability to raise specific error codes while sanitizing attributes.
  • Discontinue premature removal of element which misses a mandatory attribute.
  • Align error codes with those used by the actual AMP validator, where possible.
  • The underlying spec_name is included in the validation error, allowing in a future PR to link a validation error directly to the spec which defines specifically why it is invalid.
  • Add assertions when generating the spec to ensure the expected spec names are present, that all spec names are unique (which will be needed later), and that the error_message for blacklisted_cdata_regex is as expected (since we map to an error code).
  • Update conversion sanitizers (image, video, audio, iframe) to emit ATTR_REQUIRED_BUT_MISSING validation errors when src is missing.
  • Make AMP_Tag_And_Attribute_Sanitizer::sanitize_disallowed_attribute_values_in_node read only, deferring removal or emptying of invalid attribute to AMP_Base_Sanitizer::remove_invalid_attribute.
  • Similarly, discontinue removing invalid descendants in AMP_Tag_And_Attribute_Sanitizer::validate_attr_spec_list_for_node() (which should be read-only) to a later point.
  • Merge redundant get_disallowed_attributes_in_node into AMP_Tag_And_Attribute_Sanitizer::sanitize_disallowed_attribute_values_in_node.
  • Eliminate missing_body_element validation error in Style sanitizer in favor of just adding the body.
  • Make AMP_Tag_And_Attribute_Sanitizer::validate_attr_spec_list_for_node() always return integer scores, rather than use a float for the special case.
  • Improve tests for AMP_Tag_And_Attribute_Sanitizer by eliminating duplicated assertions and adding error code checking to each test.
  • Fix handling of special case URL __amp_source_origin, preventing it from being interpreted as a relative URL.

See #1420.

For Subsequent PRs

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

@googlebot googlebot added the cla: yes Signed the Google CLA label Nov 16, 2019
@westonruter westonruter added this to the v1.5 milestone Nov 16, 2019
@westonruter westonruter force-pushed the add/invalid-markup-reason branch from a909107 to aa129aa Compare November 22, 2019 23:22
@westonruter westonruter force-pushed the add/invalid-markup-reason branch 2 times, most recently from a2ff9d1 to e5ef273 Compare November 26, 2019 13:47
@westonruter westonruter marked this pull request as ready for review November 26, 2019 14:39
Copy link
Collaborator

@schlessera schlessera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at everything in detail yet, but added a few general comments and questions about the approach.

@@ -537,13 +551,15 @@ public function prepare_validation_error( array $error = [], array $data = [] )

if ( $node instanceof DOMElement ) {
if ( ! isset( $error['code'] ) ) {
$error['code'] = AMP_Validation_Error_Taxonomy::INVALID_ELEMENT_CODE;
$error['code'] = AMP_Tag_And_Attribute_Sanitizer::DISALLOWED_TAG;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class retrieves internals from one of its extending classes.

Can we extract all of the validation error constants into a separate class? I tend to create an interface with only constants in such a case. An interface with constants better communicates what is happening: two different pieces of code have a contract about how to refer to a specific problem. Contracts should be encoded in interface, not use internal logic of one another.

namespace Amp\AmpWP;

interface ValidationError {
	const DISALLOWED_TAG             = 'DISALLOWED_TAG';
	const DISALLOWED_CHILD_TAG       = 'DISALLOWED_CHILD_TAG';
	const DISALLOWED_FIRST_CHILD_TAG = 'DISALLOWED_FIRST_CHILD_TAG';
	// [...]
}

Thi is separate of any inheritance chain, and provides one central place where you retrieve constants from. The interface name also makes the syntax nicer to read:

$error['code'] = ValidationError::DISALLOWED_TAG;

If we need to have more grouping, we can have multiple interfaces, like CssValidationError, HtmlValidationError, ...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. On one hand, I think this should just be removed entirely, and let the code be UNKNOWN_ERROR. Each sanitizer should be responsible for indicating the code, specifically.

Otherwise, I think this should be scoped with what you discussed in #3780 (comment). It's bound up with generating PHP objects from the spec.

So I think we should do it, but probably not in the scope of this PR.

$error_code = self::INVALID_BLACKLISTED_VALUE_REGEX;
} elseif ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) &&
AMP_Rule_Spec::FAIL === $this->check_attr_spec_rule_value_properties( $node, $attr_name, $attr_spec_rule ) ) {
// @todo Should there be a separate validation error for each invalid property?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have 100 properties, you'd love to know exactly which ones are invalid and which ones are valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Either the invalid properties could be communicated as a property of the one validation error, or there could be multiple validation errors each for a single property. That's the question here. I think the latter is better.

@westonruter
Copy link
Member Author

Rebasing after large PR merged into develop

@westonruter
Copy link
Member Author

@schlessera There are 3 error codes left which do not have test assertions:

  • CDATA_VIOLATES_BLACKLIST: The first one should never occur because it will only happen if the switch statement passes through all cases, which we currently catch when parsing the spec.
  • CSS_SYNTAX_PARSE_ERROR: I am not aware of this being a possible error code in practice because the PHP CSS Parser is configured with “LenientParsing” enabled: “When used (which is true by default), the parser will not choke on unexpected tokens but simply ignore them.”
  • STYLESHEET_INVALID_FILE_PATH: I couldn't find a way to cause this error scenario.

…id-markup-reason

* 'develop' of github.com:ampproject/amp-wp:
  Pull the built `block-libray` package from Gutenberg SVN if it does not exists (#3847)
  Update dependency @babel/plugin-transform-react-jsx to v7.7.4 (#3688)
  Update dependency @babel/plugin-proposal-class-properties to v7… (#3687)
  Update dependency @babel/cli to v7.7.4 (#3685)
  Update dependency browserslist to v4.7.3 (#3792)
  Update dependency postcss to v7.0.23 (#3791)
  Update dependency autoprefixer to v9.7.2 (#3679)
  For the Gallery block, use the recommended amp-lightbox-gallery
@schlessera
Copy link
Collaborator

STYLESHEET_INVALID_FILE_PATH does not make much sense in our current context. Could it be that this only relates to Optimized AMP SSR?

@@ -82,7 +82,7 @@ abstract class AMP_Base_Sanitizer {
*
* @var array
*/
private $should_not_removed_nodes = [];
private $nodes_to_keep = [];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😜

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe I should mention I searched for usage first:

Image 2019-12-03 at 4 43 35 PM

@@ -1466,6 +1518,10 @@ private function check_attr_spec_rule_disallowed_relative( DOMElement $node, $at
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOW_RELATIVE ] ) && ! $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOW_RELATIVE ] ) {
if ( $node->hasAttribute( $attr_name ) ) {
foreach ( $this->extract_attribute_urls( $node->getAttributeNode( $attr_name ) ) as $url ) {
if ( '__amp_source_origin' === $url ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a constant for this as well?

@@ -1486,6 +1542,10 @@ private function check_attr_spec_rule_disallowed_relative( DOMElement $node, $at
foreach ( $attr_spec_rule[ AMP_Rule_Spec::ALTERNATIVE_NAMES ] as $alternative_name ) {
if ( $node->hasAttribute( $alternative_name ) ) {
foreach ( $this->extract_attribute_urls( $node->getAttributeNode( $alternative_name ), $attr_name ) as $url ) {
if ( '__amp_source_origin' === $url ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use of a constant makes sense.

tests/php/test-tag-and-attribute-sanitizer.php Outdated Show resolved Hide resolved
@schlessera
Copy link
Collaborator

Did a quick search, and no one seems to be using the constants in AMP_Validation_Error_Taxonomy, so BC should be fine.

westonruter and others added 4 commits December 6, 2019 19:17
…id-markup-reason

* 'develop' of github.com:ampproject/amp-wp: (143 commits)
  Update dependency @wordpress/block-editor to v3.3.0 (#3691)
  Update dependency @wordpress/editor to v9.8.0 (#3693)
  Update dependency @wordpress/compose to v3.8.0 (#3736)
  Use comment as array key for data set to show when failure happens
  Remove unused AMP_YouTube_Embed_Handler::sanitize_v_arg method after 7a97571
  Bump stylesheet cache group after #3866 (#3880)
  Delete AMP_YouTube_Embed_Handler::shortcode() and oembed()
  Delete AMP_Twitter_Embed_Handler::oembed()
  Prevent wrapping plugin names in code tags
  Update dependency @wordpress/blocks to v6.8.0 (#3734)
  Update dependency @wordpress/core-data to v2.8.0 (#3737)
  Update dependency @wordpress/edit-post to v3.9.0 (#3692)
  Update dependency @wordpress/components to v8.4.0 (#3735)
  Update dependency @wordpress/element to v2.9.0 (#3741)
  Align @param descriptions in test_video_override
  Replace a call to ->shortcode() with the logic from shortcode()
  Refactor AMP_Vimeo_Embed_Handler::shortcode() into video_override()
  Deprecate AMP_YouTube_Embed_Handler::shortcode()
  Restore AMP_YouTube_Embed_Handler::video_override()
  Improve theme inline CSS checks
  ...
Remove STYLESHEET_INVALID_FILE_PATH code and always fall-back to HTTP request when file not on file system
@westonruter
Copy link
Member Author

STYLESHEET_INVALID_FILE_PATH does not make much sense in our current context. Could it be that this only relates to Optimized AMP SSR?

Not really. It's an error that was returned (sometimes) when a file looked like it was on the filesystem but it could not be read. However, I've now eliminated this in 827659a by always falling back to an HTTP request when the file can't seem to be found on the filesystem.

@kienstra
Copy link
Contributor

kienstra commented Dec 9, 2019

If it's alright, tomorrow might be the earliest I could review this.

@schlessera
Copy link
Collaborator

@westonruter You've moved this back into Ready for Review, but looking at it, it seems to be merged and there was no mention of this needing additional changes. Is this Done?

@westonruter
Copy link
Member Author

I hadn't moved it to Done yet because #1420 isn't done yet and I wanted to make sure the other follow-ups had been filed as issues. They have now, so I'll move this to done.

@schlessera schlessera deleted the add/invalid-markup-reason branch March 7, 2020 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Signed the Google CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants