BLUEPRINT Reader Mode

Lots of web design out there is really bloated, ugly, and full of trackers. I want Odysseus to help fix that by adding a reader mode. The experience here is quite simple: An addressbar icon will show if Odysseus determines that it can make the site look nicer (if this icon doesn't show, it either means the site is really bloated or not bloated at all) and clicking it will toggle an alternate rendering of the page (under a reader: URI scheme) without that bloat. This setting will be persisted on a per-site basis, because sites understandably tend to share a template.

This will be based loosely on Firefox's implementation, but I'll keep it basic. And I'll want to implement it in a way to minimize it's performance-hit on page load.

That is I'll implement it ontop of a SAX-style API rather than a DOM-style API because that makes better use of memory, it can be run in a separate thread, and given the API for it (which I don't appear to have) the computation can be done earlier.

Agorithm for whether to offer a Reader Mode

For each closing tag (lightly adjusted per HTML5), score the element:

ignore elements set to be invisible, or are empty.
track whether the siblings are inline or block, and number of child elements.
track total length of contained text and total length of contained link text.
each coma in text scores 1 to childscore.
each content-level block element (section,h2,h3,h4,h5,h6,p,td,pre,inline div) scores 1 + childscore, both individually & to childscore; unless there's < 25 chars.
for those every 100 characters are scored, max 3.
div on single elements are treated like the didn't exist, that they're just the children.
container blocks score childscore, contributing 1/3 of that to childscore.
all blocks have their individual scores adjusted for text/link ratio (biggest indicator of nav vs content).

For each scored element, the top 5 are tracked which score no lower than 2/3 of the top candidate.
Consider parents of the top 5 content elements if they're score doesn't drop too low. This involves:

Considering any parent of the top candidate which contains at least 3 of the other top candidates, and up.
Choosing the best parent whose score doesn't decrease less to less than 1/3 of the top candidate.
ignore single-parent elements.

If a top candidate is found is found (indicating we can tidy the page up), and it's not the body (indicating the page doesn't need tidying up), show the addressbar icon offering to tidy the page up. In either case I don't want to show it.

Algorithm for rendering Reader Mode

Handle all reader-mode: URIs, stripping the schema off for the URI of the page to render.
Apply the former algorithm to that page in order to find the page's content.
Parse the top element into an XML tree.
Incorporate any siblings that scores at least 1/5 of the top candidate (min 10).

add bonus points for sharing the same classes.
also include paragraphs > 80 characters with < 1/4 of it being links, or any that contains sentences.

Clean up any junk that may be interleaved in the content.

remove any presentation attributes (predominantly style, or on height/width on tables).
remove any tables that appear to be for presentation. That is keep ones that doesn't have a role=presentation, datatable=0, nested tables (light suggestion); and it does a summary or caption, a (col,colgroup,tfoot,thead,th), >= 10 rows, > 4 columns, or > 10 grid cells.
(Unlike Firefox I don't want to remove any forms in the page's content)
Remove any embeds/iframes unless they're YouTube or Vimeo videos (ideally I'd translate those into native-ish controls.
Remove any footers

Embed in a new HTML page with a base tag to adjust all URLs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLUEPRINT Reader Mode

Agorithm for whether to offer a Reader Mode

Algorithm for rendering Reader Mode

Clone this wiki locally