Skip to content

Commit

Permalink
HTML API: Ensure that breadcrumbs are properly retained after seeking.
Browse files Browse the repository at this point in the history
In some cases, it's possible to seek back into a location found inside
an element which has been closed before the point in the document where
the `seek()` was made. In these cases the breadcrumb stack is lost, and
calling `get_breadcrumbs()` after the seek will return the wrong information.

In this patch, the HTML Processor takes a conservative approach and
moves to the front of the document, then reparses the document until
it reaches the sought-after location. This ensures consistency on
the stack of open elements and active formats, and preserves
breadcrumbs.

Developed in WordPress#6185
Discussed in https://core.trac.wordpress.org/ticket/60687

Props jonsurrell.
Follow-up to [60687].
See #58517.
Fixes #60687.



git-svn-id: https://develop.svn.wordpress.org/trunk@57768 602fd350-edb4-49c9-b593-d223f7449a82
  • Loading branch information
dmsnell committed Mar 5, 2024
1 parent 70f3746 commit 4d562a1
Show file tree
Hide file tree
Showing 2 changed files with 94 additions and 39 deletions.
111 changes: 72 additions & 39 deletions src/wp-includes/html-api/class-wp-html-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -241,31 +241,31 @@ public static function create_fragment( $html, $context = '<body>', $encoding =
return null;
}

$p = new self( $html, self::CONSTRUCTOR_UNLOCK_CODE );
$p->state->context_node = array( 'BODY', array() );
$p->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY;
$processor = new self( $html, self::CONSTRUCTOR_UNLOCK_CODE );
$processor->state->context_node = array( 'BODY', array() );
$processor->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY;

// @todo Create "fake" bookmarks for non-existent but implied nodes.
$p->bookmarks['root-node'] = new WP_HTML_Span( 0, 0 );
$p->bookmarks['context-node'] = new WP_HTML_Span( 0, 0 );
$processor->bookmarks['root-node'] = new WP_HTML_Span( 0, 0 );
$processor->bookmarks['context-node'] = new WP_HTML_Span( 0, 0 );

$p->state->stack_of_open_elements->push(
$processor->state->stack_of_open_elements->push(
new WP_HTML_Token(
'root-node',
'HTML',
false
)
);

$p->state->stack_of_open_elements->push(
$processor->state->stack_of_open_elements->push(
new WP_HTML_Token(
'context-node',
$p->state->context_node[0],
$processor->state->context_node[0],
false
)
);

return $p;
return $processor;
}

/**
Expand Down Expand Up @@ -1226,6 +1226,10 @@ public function release_bookmark( $bookmark_name ) {
/**
* Moves the internal cursor in the HTML Processor to a given bookmark's location.
*
* Be careful! Seeking backwards to a previous location resets the parser to the
* start of the document and reparses the entire contents up until it finds the
* sought-after bookmarked location.
*
* In order to prevent accidental infinite loops, there's a
* maximum limit on the number of times seek() can be called.
*
Expand All @@ -1247,44 +1251,73 @@ public function seek( $bookmark_name ) {
$bookmark_starts_at = $this->bookmarks[ $actual_bookmark_name ]->start;
$direction = $bookmark_starts_at > $processor_started_at ? 'forward' : 'backward';

switch ( $direction ) {
case 'forward':
// When moving forwards, reparse the document until reaching the same location as the original bookmark.
while ( $this->step() ) {
if ( $bookmark_starts_at === $this->bookmarks[ $this->state->current_token->bookmark_name ]->start ) {
return true;
}
/*
* If seeking backwards, it's possible that the sought-after bookmark exists within an element
* which has been closed before the current cursor; in other words, it has already been removed
* from the stack of open elements. This means that it's insufficient to simply pop off elements
* from the stack of open elements which appear after the bookmarked location and then jump to
* that location, as the elements which were open before won't be re-opened.
*
* In order to maintain consistency, the HTML Processor rewinds to the start of the document
* and reparses everything until it finds the sought-after bookmark.
*
* There are potentially better ways to do this: cache the parser state for each bookmark and
* restore it when seeking; store an immutable and idempotent register of where elements open
* and close.
*
* If caching the parser state it will be essential to properly maintain the cached stack of
* open elements and active formatting elements when modifying the document. This could be a
* tedious and time-consuming process as well, and so for now will not be performed.
*
* It may be possible to track bookmarks for where elements open and close, and in doing so
* be able to quickly recalculate breadcrumbs for any element in the document. It may even
* be possible to remove the stack of open elements and compute it on the fly this way.
* If doing this, the parser would need to track the opening and closing locations for all
* tokens in the breadcrumb path for any and all bookmarks. By utilizing bookmarks themselves
* this list could be automatically maintained while modifying the document. Finding the
* breadcrumbs would then amount to traversing that list from the start until the token
* being inspected. Once an element closes, if there are no bookmarks pointing to locations
* within that element, then all of these locations may be forgotten to save on memory use
* and computation time.
*/
if ( 'backward' === $direction ) {
/*
* Instead of clearing the parser state and starting fresh, calling the stack methods
* maintains the proper flags in the parser.
*/
foreach ( $this->state->stack_of_open_elements->walk_up() as $item ) {
if ( 'context-node' === $item->bookmark_name ) {
break;
}

return false;

case 'backward':
/*
* When moving backwards, clear out all existing stack entries which appear after the destination
* bookmark. These could be stored for later retrieval, but doing so would require additional
* memory overhead and also demand that references and bookmarks are updated as the document
* changes. In time this could be a valuable optimization, but it's okay to give up that
* optimization in exchange for more CPU time to recompute the stack, to re-parse the
* document that may have already been parsed once.
*/
foreach ( $this->state->stack_of_open_elements->walk_up() as $item ) {
if ( $bookmark_starts_at >= $this->bookmarks[ $item->bookmark_name ]->start ) {
break;
}
$this->state->stack_of_open_elements->remove_node( $item );
}

$this->state->stack_of_open_elements->remove_node( $item );
foreach ( $this->state->active_formatting_elements->walk_up() as $item ) {
if ( 'context-node' === $item->bookmark_name ) {
break;
}

foreach ( $this->state->active_formatting_elements->walk_up() as $item ) {
if ( $bookmark_starts_at >= $this->bookmarks[ $item->bookmark_name ]->start ) {
break;
}
$this->state->active_formatting_elements->remove_node( $item );
}

$this->state->active_formatting_elements->remove_node( $item );
}
parent::seek( 'context-node' );
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY;
$this->state->frameset_ok = true;
}

// When moving forwards, reparse the document until reaching the same location as the original bookmark.
if ( $bookmark_starts_at === $this->bookmarks[ $this->state->current_token->bookmark_name ]->start ) {
return true;
}

return parent::seek( $actual_bookmark_name );
while ( $this->step() ) {
if ( $bookmark_starts_at === $this->bookmarks[ $this->state->current_token->bookmark_name ]->start ) {
return true;
}
}

return false;
}

/**
Expand Down
22 changes: 22 additions & 0 deletions tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php
Original file line number Diff line number Diff line change
Expand Up @@ -545,4 +545,26 @@ public function test_can_seek_back_and_forth() {
$processor->seek( 'second' );
$this->assertTrue( $processor->get_attribute( 'two' ) );
}

/**
* Ensures that breadcrumbs are properly reported after seeking backward to a location
* inside an element which has been fully closed before the seek.
*
* @ticket 60687
*/
public function test_retains_proper_bookmarks_after_seeking_back_to_closed_element() {
$processor = WP_HTML_Processor::create_fragment( '<div><img></div><div><hr></div>' );

$processor->next_tag( 'IMG' );
$processor->set_bookmark( 'first' );

$processor->next_tag( 'HR' );

$processor->seek( 'first' );
$this->assertSame(
array( 'HTML', 'BODY', 'DIV', 'IMG' ),
$processor->get_breadcrumbs(),
'Should have retained breadcrumbs from bookmarked location after seeking backwards to it.'
);
}
}

0 comments on commit 4d562a1

Please sign in to comment.