HTML API: Add CSS selector support #7857

sirreal · 2024-11-21T10:56:52Z

This is not ready for final review but is ready for early feedback, especially around the open questions listed below.

Introduce CSS selector based traversal of HTML documents in the HTML API. Add new select_all and select methods to the Tag Processor and HTML Processor.

// With select_all to traverse a document stopping on matching selectors
$processor = WP_HTML_Processor::create_full_parser( '<p match><div att match><em><i match></i><a match>' );
foreach ( $processor->select_all( 'p, [att], em > *' ) as $_ ) {
	assert( $processor->get_attribute( 'match' ) );
}

// With select to move to a matching selector
$processor = WP_HTML_Processor::create_full_parser( '<p match><div att match><em><i match></i><a match>' );
assert( $processor->select( 'p, [att], em > *' ) );
assert( $processor->get_attribute( 'match' ) );
assert( 'P' === $processor->get_tag() );

A subset of the CSS selector grammar is available as described here:

wordpress-develop/src/wp-includes/html-api/class-wp-css-compound-selector-list.php

Lines 24 to 39 in 355c9a2

    
            * This class is analogous to <compound-selector-list> in the grammar. The supported grammar is: 
        
            * 
        
            *     <selector-list> = <complex-selector-list> 
        
            *     <complex-selector-list> = <complex-selector># 
        
            *     <compound-selector-list> = <compound-selector># 
        
            *     <complex-selector> = [ <type-selector> <combinator>? ]* <compound-selector> 
        
            *     <compound-selector> = [ <type-selector>? <subclass-selector>* ]! 
        
            *     <combinator> = '>' | [ '|' '|' ] 
        
            *     <type-selector> = <ident-token> | '*' 
        
            *     <subclass-selector> = <id-selector> | <class-selector> | <attribute-selector> 
        
            *     <id-selector> = <hash-token> 
        
            *     <class-selector> = '.' <ident-token> 
        
            *     <attribute-selector> = '[' <ident-token> ']' | 
        
            *                            '[' <ident-token> <attr-matcher> [ <string-token> | <ident-token> ] <attr-modifier>? ']' 
        
            *     <attr-matcher> = [ '~' | '|' | '^' | '$' | '*' ]? '=' 
        
            *     <attr-modifier> = i | s

Notable variations from selectors specification:

Pseudo-element selectors are not supported. Pseudo elements will not exist in the HTML and it's unclear what benefit they would bring.

Pseudo-class selectors are not supported. Pseudo classes could be useful, but the logic to parse and match pseudo-class selectors would add significant complexity. There's also a lot of variety in pseudo-selectors, and rather than supporting simpler selectors (e.g. :empty) and not supporting more complex selectors (e.g. :nth-child()), pseudo-class selectors are completely unsupported. This is a clear and simple rule that greatly simplifies the implementation.

Complex selectors have limited support. Complex selectors are combined selectors using one of the combinators: (whitespace), >, +, or ~. The Tag Processor does not support complex selectors at all, it has no concept of document structure, and all complex selectors are structural. The HTML Processor does not support the sibling combinators + or ~, it only supports a parent/child or ancestor/descendant relationship. These selectors can be handled without tracking additional state in the document by analyzing breadcrumbs. Finally, only type selectors are allowed in non-final position, again because this allows matching against breadcrumbs without tracking additional state:

Supported: body heading > h1.page-title[attribute]
Unsupported (class / id selector in non final position): #page > main, .page main
Unsupported (sibling selectors not supported): ul li ~ li, ul li + li.

Importantly, the selectors supported by the HTML Processor should be sufficient to support all core block attribute selectors according to this PR:

wordpress-develop/src/wp-includes/html-api/class-wp-html-attribute-sourcer.php

Lines 18 to 49 in 88e7d30

    
           /** 
        
            * Existing Core Selectors 
        
            * 
        
            *     .blocks-gallery-caption 
        
            *     .blocks-gallery-item 
        
            *     .blocks-gallery-item__caption 
        
            *     .book-author 
        
            *     .message 
        
            *     a 
        
            *     a[download] 
        
            *     audio 
        
            *     blockquote 
        
            *     cite 
        
            *     code 
        
            *     div 
        
            *     figcaption 
        
            *     figure > a 
        
            *     figure a 
        
            *     figure img 
        
            *     figure video,figure img 
        
            *     h1,h2,h3,h4,h5,h6 
        
            *     img 
        
            *     li 
        
            *     ol,ul 
        
            *     p 
        
            *     pre 
        
            *     tbody tr 
        
            *     td,th 
        
            *     tfoot tr 
        
            *     thead tr 
        
            *     video 
        
            */

Most of the listed selectors are also supported by the Tag Processor with the exception of the complex selectors:

figure > a
figure a
figure img
figure video,figure img
tbody tr
tfoot tr
thead tr

Open questions

Implementation

The implementation introduces a number of classes. The classes roughly correspond to different parts of the selector grammar. Parsing is handled in the _list classes that represent the top of the grammar. Matching logic is handled by each selector class. The selector classes implement a matches interface.

~~I'm not happy with how the implementation is distributed. I'd like to move in one of two directions:~~

Move the parsing logic for selectors into the selector classes, so each selector class is responsible for parsing itself.

I've implemented this change.

OR

~~Move matching logic into the top level *_list classes to that the lower level selectors are only data containers.~~

~~My preference at this time (and the original implementation) would be the former.~~

Selector traversal APIs

select_all is implemented as a generator and (except _doing_it_wrong) has no way to differentiate between "nothing matched" and "the selector was invalid or unsupported". It simply yields nothing in both cases.
select uses select_all internally and has the same limitation.
select_all expects the document position to remain the same in order to visit all matching tags. Because none of the supported selectors rely on stateful logic, this is not an issue at this time.

Todo:

Get and address feedback on open questions.
Split into smaller PRs for review. Specifically, the selector lists can be split into sepearate PRs starting with compound and following with complex selectors.

sirreal/html-api-debugger#5 can be used for testing the parsing and matching behavior.

Trac ticket: https://core.trac.wordpress.org/ticket/62653

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

github-actions · 2024-11-21T11:18:36Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

sirreal · 2024-11-26T18:41:48Z

src/wp-includes/html-api/class-wp-css-selectors.php

+ * - The following combinators:
+ *   - descendant (e.g. `.parent .descendant`)
+ *   - child (`.parent > .child`)


I don't think this will be supported initially. Maybe no combinators at all, only simple selectors.

These two combinators can probably be supported by the HTML API as long as they non-final selector is an element selector:

div > .className ✅ supported

.className > div ⛔️ not supported

Tags can be "seen" via bookmarks, while things like IDs, classes, attributes etc. would require seeking or more advanced internal tracking.

Is the problem that we'd need to keep track of all the attributes requested by the selector in the entire breadcrumbs trail?

Exactly. Either breadcrumbs would need to store all of their attributes or we'd need to seek around the document to check them which would be very costly.

All of the supported selectors are things that can be checked from given tag in the document without any seeking as the HTML API is implemented now.

If we knew the selector upfront, we could always store the attributes needed to match it, e.g. store the class attribute if we know we'll only ever look for div > .className. Only that wouldn't be practical.

However, keeping track of classnames in all the breadcrumbs elements seems okay-ish resource-wise.

Perhaps we could even have a "CSS-enabled" mode where we'd keep track of all the breadcrumbs attributes shorter than, say, 100 bytes? It wouldn't be perfect but should suffice for most attribute-based matches.

There will be ways to support more selectors, one good idea is to set some state to track data for certain selectors as the document is traversed.

One tricky thing with that implementation is that it may be necessary to return to the start of the document depending on the selectors used and scan again from there so that the necessary data is collected during traversal. This probably involves analyzing the selectors, deciding what data needs to be stored, seeking to the beginning if necessary, seeking back to the current location, and then trying to find the selector.

The complexity here is significant. It can be overcome, but I think the set of selectors supported in this PR as it is now is a good starting place with very limited matching complexity. Iteration to improve support can be added in future PRs.

Oh! Everything I said I meant for the future, no need to hold up this work :)

On seeking - sounds reasonable, although it won't combine well with streaming. What would an example of a selector that requires backtracking?

What would an example of a selector that requires backtracking?

This is mostly necessary to support selecting from arbitrary positions in the document. If we knew we were at the start of the document, we could determine what additional state needed to be tracked and start collecting it.

The sibling combinators + and ~ select based on preceding elements, so it would be necessary to back up to check for matches.

Support for subclass selectors (ID, class, and attribute) in non-final position, like #my-id div would require backing up to parse some attributes along with breadcrumbs.

If pseudo-class selectors were added, I'd expect many of them to require more understanding of the document structure.

github-actions · 2024-12-09T19:56:40Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, zieladam.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Experimental support for selectors in WordPress/wordpress-develop#7857

sirreal · 2024-12-10T17:30:50Z

This can be tested/demoed using the playground + HTML API debugger plugin

sirreal · 2024-12-10T17:52:11Z

src/wp-includes/html-api/class-wp-css-complex-selector.php

+	 * @param string[] $breadcrumbs Breadcrumbs.
+	 * @return bool True if a match is found, otherwise false.
+	 */
+	private function explore_matches( array $selectors, array $breadcrumbs ): bool {


This is a recursive function. It's bounded by the length of the breadcrumbs list and the depth of the complex selector, but it may be good to set some limits here.

sirreal · 2024-12-10T17:54:33Z

I'm not sure how useful it would be, but I think it would be interesting to add namespaces to type selectors. We could declare 4 namespaces to support: *, html, svg, and math. Then things could be selected like:

Any MathML element: math|*
SVG Title elements: svg|title

sirreal · 2024-12-11T18:24:45Z

I'm not happy with how the implementation is distributed. I'd like to move in one of two directions:

Move the parsing logic for selectors into the selector classes, so each selector class is responsible for parsing itself.

I've changed the class structures so that all the selectors extend an abstract base class that provides some constants and utilities for parsing selectors. Each selector class now defines how it is parsed to construct an instance of itself.

Most of the classes expose properties that should be private and only used in their internal match methods. The properties are inspected in tests now, but this should be changed.

sirreal force-pushed the html-api/add-css-selector-parser branch 2 times, most recently from 184ad41 to 8d9d9e0 Compare November 25, 2024 17:33

sirreal commented Nov 26, 2024

View reviewed changes

sirreal force-pushed the html-api/add-css-selector-parser branch 3 times, most recently from 370539b to 9ecaab5 Compare November 29, 2024 15:25

sirreal force-pushed the html-api/add-css-selector-parser branch from 8659898 to e5e4c7e Compare December 4, 2024 20:57

sirreal changed the title ~~HTML API: Add CSS selector parsing~~ HTML API: Add CSS selector support Dec 5, 2024

sirreal force-pushed the html-api/add-css-selector-parser branch from e5e4c7e to b7e032e Compare December 5, 2024 13:52

sirreal added 20 commits December 5, 2024 22:52

WIP class skeleton

0e8c4fb

Document class

2d3d283

Do not support namespaced selectors

40222d3

Flesh out stuff

6092642

Starting to actually parse

3e3b2b2

Add ident tests

967557f

Fix ident non-ascii bug

2ec1db3

Use class after defined

ee2c7ce

Fix some char stuff

0f708ba

Improve tests

3cb455d

Housekeeping

5609e50

Require new file in WP

4f25bc2

Fix offset type

943293f

Add more tests and invalid tests

24c9744

Fix wrong offset var usage

a7c10b9

comment tweak

dd718b7

Implement codepoint escape with strspn

5884aca

Test with UPPER HEX

a9a077f

Add ID tests

5f53e0a

Improve tests

effbbbe

sirreal marked this pull request as ready for review December 9, 2024 19:56

sirreal mentioned this pull request Dec 9, 2024

Add CSS selector support sirreal/html-api-debugger#5

Merged

Merge branch 'trunk' into html-api/add-css-selector-parser

7ef67c1

gziolo mentioned this pull request Dec 10, 2024

Blocks: Optional support that optimizes handling blocks for usage as structured data WordPress/gutenberg#49437

Open

This was referenced Dec 10, 2024

WIP: Introduce class for sourcing block attributes from HTML WordPress/gutenberg#46345

Draft

HTML API: Add attribute sourcing #6388

Draft

Dennis' list of broad and interesting things. WordPress/gutenberg#62437

Open

sirreal added a commit to sirreal/html-api-debugger that referenced this pull request Dec 10, 2024

Add CSS selector support (#5)

fc62b3f

Experimental support for selectors in WordPress/wordpress-develop#7857

sirreal commented Dec 10, 2024

View reviewed changes

sirreal added 12 commits December 11, 2024 14:54

Merge branch 'trunk' into html-api/add-css-selector-parser

abb4d25

Merge branch 'trunk' into html-api/add-css-selector-parser

1c1e29a

Move parsing back to selector classes

2cf7e46

Update tests for class parsing

aca3d89

Use whitepsace chars constant

28aafe8

parse_whitespace should be protected

aa27be3

Update interface to abstract class require

2a5fc0c

Document base class

eb12c60

Invert and comment confusing compound selector condition

1bbb82b

Use switch in compound selector parsing

da79b23

Fix up some todo-s

e193621

Make most selector constructors private

85e4d6a

sirreal added 4 commits December 11, 2024 19:32

Fix test class implementation of abstract class

40d914b

Remove php 8+ ?static return types

fff9309

Fix typo in Exception class name

776d4b3

Remove ?static return type from test

dd6d8cd

sirreal mentioned this pull request Dec 12, 2024

Block API: Add attribute-sourced block attributes on server #8000

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Add CSS selector support #7857

HTML API: Add CSS selector support #7857

sirreal commented Nov 21, 2024 •

edited

Loading

github-actions bot commented Nov 21, 2024

sirreal Nov 26, 2024

adamziel Dec 10, 2024

sirreal Dec 10, 2024

adamziel Dec 10, 2024

sirreal Dec 10, 2024

adamziel Dec 10, 2024

sirreal Dec 11, 2024

github-actions bot commented Dec 9, 2024 •

edited

Loading

sirreal commented Dec 10, 2024

sirreal Dec 10, 2024

sirreal commented Dec 10, 2024

sirreal commented Dec 11, 2024 •

edited

Loading

	* This class is analogous to <compound-selector-list> in the grammar. The supported grammar is:
	*
	* <selector-list> = <complex-selector-list>
	* <complex-selector-list> = <complex-selector>#
	* <compound-selector-list> = <compound-selector>#
	* <complex-selector> = [ <type-selector> <combinator>? ]* <compound-selector>
	* <compound-selector> = [ <type-selector>? <subclass-selector>* ]!
	* <combinator> = '>' \| [ '\|' '\|' ]
	* <type-selector> = <ident-token> \| '*'
	* <subclass-selector> = <id-selector> \| <class-selector> \| <attribute-selector>
	* <id-selector> = <hash-token>
	* <class-selector> = '.' <ident-token>
	* <attribute-selector> = '[' <ident-token> ']' \|
	* '[' <ident-token> <attr-matcher> [ <string-token> \| <ident-token> ] <attr-modifier>? ']'
	* <attr-matcher> = [ '~' \| '\|' \| '^' \| '$' \| '*' ]? '='
	* <attr-modifier> = i \| s

	/**
	* Existing Core Selectors
	*
	* .blocks-gallery-caption
	* .blocks-gallery-item
	* .blocks-gallery-item__caption
	* .book-author
	* .message
	* a
	* a[download]
	* audio
	* blockquote
	* cite
	* code
	* div
	* figcaption
	* figure > a
	* figure a
	* figure img
	* figure video,figure img
	* h1,h2,h3,h4,h5,h6
	* img
	* li
	* ol,ul
	* p
	* pre
	* tbody tr
	* td,th
	* tfoot tr
	* thead tr
	* video
	*/

HTML API: Add CSS selector support #7857

Are you sure you want to change the base?

HTML API: Add CSS selector support #7857

Conversation

sirreal commented Nov 21, 2024 • edited Loading

Open questions

Implementation

Selector traversal APIs

github-actions bot commented Nov 21, 2024

Test using WordPress Playground

Some things to be aware of

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 9, 2024 • edited Loading

sirreal commented Dec 10, 2024

Choose a reason for hiding this comment

sirreal commented Dec 10, 2024

sirreal commented Dec 11, 2024 • edited Loading

sirreal commented Nov 21, 2024 •

edited

Loading

github-actions bot commented Dec 9, 2024 •

edited

Loading

sirreal commented Dec 11, 2024 •

edited

Loading