Skip to content
Danny Lin edited this page Aug 2, 2024 · 21 revisions

Data scheme

Below is the data scheme WebScrapBook follows. You may also want to check the data scheme of ScrapBook X for comparison.

The scrapbook directory

WebScrapBook has three top-level subdirectories for a scrapbook (root):

  • .wsb: to store configs for WebScrapBook or the backend server (PyWebScrapBook).
  • tree: to store metadata and static site files.
  • data: to store captured data.

Both tree and data are configurable in the backend server:

  • data can be root or a subdirectory of any level under root, but can not be put under .wsb.
  • tree can be a subdirectory of any level under root, and can be put under .wsb. Although not strictly restricted, it is recommended to use the preserved .wsb/tree when put under .wsb, to prevent an unexpected interference of the backend server.

For simplicity, we will call them data and tree in this documentation.

The data directory

In WebScrapBook, the corresponding index file of an item can either be */index.html, *.htz, *.maff, *.html, or *.htm under data or its subdirectory of any level.

The corresponding index file of an item can be freely specified via the index property; though it's generally recommended to keep them matched. Imagine an item Wikipedia with an index file named whiskey-idiot/index.html... That's odd, isn't it?

There are something to note, however:

  • Resource files are only allowed for an items with an index file of */index.html, *.htz, and *.maff.

    • For */index.html, all files and directories under */ are considered part of the item.
    • For *.htz, it should be a ZIP file containing index.html as index.
    • For *.maff, it should be a ZIP file containing */index.rdf or */index.* as index, following MAFF specification.
    • *.html is for an item of single HTML file (which has all resources embedded in). Any resource linked from it is considered external and not part of the item.
    • *.htm, when used as the index file, is reserved for a bookmark item. It should contain a meta refresh pointing to the source URL, which corresponds to the source property of the item.
  • To prevent confusion, different items should not share a same index file.

  • To prevent confusion, the index file and resource files of an item should not reside under the directory of another item (of */index.html).

    • More strictly, any item-containing subdirectory should not contain index.html, even if it's not yet defined as an item, to prevent mis-intepretation of WebScrapBook indexer.
    • For the same reason, an item of single HTML file (*.html) should not be named index.html, or it may be mis-intepreted as being */index.html.

    Take the following data structure for example:

    data
        subdir
            item1
                index.html
                ...
            item2.htz
            item3.maff
            item4.html
            item5.htm
            index.html
    
    • If subdir contains index.html, subdir will be considered an item, and all files and directories, i.e. item1...5, will be treated as parts of it. In such case, item1...5 can only be the resources of the subdir item, and cannot be other items.
    • On the other hand, if subdir has one or more of item1, item2, item3, item4, or item5 as item(s), it should never contain the index.html file.
  • The index file can also be a meta refresh targeting to a resource file, which can be either a web page or a file.

    It's been a convention for several web page saving applications to save a page as <name>.htm(l) with a supporting folder named <name>.files or <name>_files. If such pages are put under a directory like this:

    dir
        page1.htm
        page1_files
        page2.htm
        page2_files
        ...
    

    and be directly fed to WebScrapBook indexer, they will be indexed as being items of single HTML file, which can potentially cause an unexpected behavior in the future.

    For WebScrapBook to handle them correctly, put each of them in an individual subdirectory with an arbitarary index.html which meta-refreshes to the main page, like:

    dir
        page1
            index.html  (meta-refreshes to page1.htm)
            page1.htm
            page1_files
        page2
            index.html  (meta-refreshes to page2.htm)
            page2.htm
            page2_files
        ...
    
  • To prevent a cross-platform compatibililty issue:

    • Chars not supported on some platforms should not be included in a filename, such as :, ", ?, *, \, |, <, >, and control chars (0x000x1F, 0x7F, 0x800x9F).
    • Files with names different only in cases should not exist in the same directory, such as myimage.jpg and MyImage.JPG.

The tree directory

Index files

There are three types of index file that holds the metadata, table of contents, and fulltext cache of the scrapbook items, which roughly corresponds to scrapbook.rdf and cache.rdf in a scrapbook of legacy ScrapBook.

  • meta#.js: stores item metadata.
  • toc#.js: stores table of contents.
  • fulltext#.js: stores fulltext cache.

Each type of index file can be separated into multiple files with # being empty or a positive integer in sequential. The content of an index file should contain a single call of scrapbook.<fn>(<data>) with <data> being a JSON-compatible text. This is to ensure cross-browser compatibility for static site files.

Static site files

  • map.html: an index page that dynamically loads index files and lists all items in the scrapbook.
  • index.html: an index page with static table of contents for SEO and for support of a browser without JavaScript. It dymanically loads index files when JavaScript is enabled anyway, and runs slower than map.html.
  • frame.html: a framed page with map.html being the sidebar TOC and a main frame to show an item.
  • search.html: the page for searching.

Above pages are automatically generated by WebScrapBook and should not be edited. Their behavior can be customized using a corresponding <name>.css and <name>.js.

Additional entries for static site files:

  • feed.atom: RSS feed file for recently updated items.
  • icon: icons for static site pages.
  • favicon: cached favicons for items with embedded favicon.

Other entries

  • templates: template files of the scrapbook. Currently note_template.html for an HTML note and note_template.md for a markdown note are supported.

The .wsb directory

  • config.ini: config of the backend server.
  • locks: writing protection lock files.
  • themes: custom themes of the backend server.
  • server: runtime data and cache of the backend server.
  • backup: backup files.

Metadata for a scrapbook item

Legacy ScrapBook uses <ID>/index.dat for an item. In WebScrapBook, item metadata are recorded directly in the root element of its index file (*/index.html, or index.html inside *.htz or *.maff, or the single HTML file):

  • data-scrapbook-id: ID of the item. This can be omitted and be automatically managed by the extension.
  • data-scrapbook-type: type of the item. This can be omitted for a regular page item, but required for other types such as "file" or "note".
  • data-scrapbook-source: source URL of the item.
  • data-scrapbook-create: create datetime of the item.
  • data-scrapbook-modify: modify datetime of the item. This can be omitted and the modified time of the file will be used as fallback.
  • data-scrapbook-title: title of the item. This can be omitted and the first title element in the page will be used as fallback.
  • data-scrapbook-icon: favicon URL of the item. This can be omitted and the favicon defined in the page will be used as fallback.
  • data-scrapbook-charset: charset of the item. This is usually used by text files.
  • data-scrapbook-comment: comment of the item.

WebScrapBook stores all item metadata in tree index files (see above tree section), and updates them when a scrapbook item is modified. Above attributes are not mandatory and only represents the initial status when the item is created, and can be useful when importing an item to another scrapbook as WebScrapBook site indexer will recognize them. Note that this behavior differs from legacy ScrapBook, which updates the index.dat besides scrapbook.rdf whenever an item is modified.

For a resourced item, the index file can contain a meta refresh pointing to the real main item page. However, item metadata should be recorded in the index file rather than the meta refreshed page in such case.

Special attributes for a scrapbook page

Below attributes record the original status of a web page captured at <time> before rewritten by WebScrapBook, if related capture options are specified:

  • data-scrapbook-orig-null-node-<time>: the element does not exist in the original page.
  • data-scrapbook-orig-null-attr-<attr>-<time>: the element does not have the attribute <attr> in the original page.
  • data-scrapbook-orig-attr-<attr>-<time>: value of attribute <attr> of the element in the original page.
  • data-scrapbook-orig-textContent-<time>: text content of the element in the original page.

Following attributes record certain information not representable by standard HTML in the original page:

  • data-scrapbook-css-disabled: the stylesheet element is disabled in the original page.
  • data-scrapbook-input-checked: checked status of the input element in the original page.
  • data-scrapbook-input-indeterminate: the input element is indeterminate in the original page.
  • data-scrapbook-input-value: value of the input element in the original page.
  • data-scrapbook-option-selected: selected status of the option element in the original page.
  • data-scrapbook-textarea-value: value of the textarea element in the original page.
  • data-scrapbook-canvas: image data of the canvas element in the original page.
  • data-scrapbook-shadowdom: shadow DOM content of the element in the original page.
  • data-scrapbook-shadowdom-mode: mode property of the shadow DOM of the element in the original page.
  • data-scrapbook-shadowdom-clonable: clonable property of the shadow DOM of the element in the original page.
  • data-scrapbook-shadowdom-delegates-focus: delegatesFocus property of the shadow DOM of the element in the original page.
  • data-scrapbook-shadowdom-serializable: serializable property of the shadow DOM of the element in the original page.
  • data-scrapbook-shadowdom-slot-assignment: slotAssignment property of the shadow DOM of the element in the original page.
  • data-scrapbook-adoptedstylesheet-<index>: the <index>th constructed stylesheets in the original page.
  • data-scrapbook-adoptedstylesheets: constructed stylesheets used by the element related DOM in the original page.
  • data-scrapbook-slot-assigned: indexes of nodes manually assigned to the SLOT element in the original page.
  • data-scrapbook-slot-index: index of this element, which is manually assigned to a SLOT element, in the original page.

And, similarly, special comments:

  • <!--scrapbook-capture-selected-->...<!--/scrapbook-capture-selected-->: the selected range(s) when "Capture selection" is performed by WebScrapBook.
  • <!--scrapbook-capture-selected-splitter-->...<!--/scrapbook-capture-selected-splitter-->: the splitter between different selected ranges in a same text-like node when "Capture selection" is performed by WebScrapBook.
  • <!--scrapbook-slot-index=<index>-->...<!--scrapbook-slot-index-->: index of this text node, which is manually assigned to a SLOT element, in the original page.

Special attributes for page editor and saver

Below are special attributes used by WebScrapBook for editting and saving a page:

  • data-scrapbook-elem: define this element as a special element, added by WebScrapBook or the user, such as a loader, a highlighting, an annotation, etc. Some names are specifically for custom usage and can be arbitarary edited:
    • data-scrapbook-elem="todo": mark this element (usually a checkbox) as a TODO element.
    • data-scrapbook-elem="title": for the main item page (the index page or the page meta-refreshed from index page), text content of the element updates to match item title when saved.
    • data-scrapbook-elem="title-src": for the main item page, text content of the element overwrites item title and other data-scrapbook-elem="title" elements when saved.
    • data-scrapbook-elem="custom": mark this element as a custom annotation and can be removed by WebScrapBook page editor.
    • data-scrapbook-elem="custom-wrapper": mark this element as a custom highlighting and can be unwrapped (removing the element but keep descendants) by WebScrapBook page editor.
    • data-scrapbook-elem="custom-css": mark this element as a custom stylesheet which should not be altered by the editor (e.g. not erased via "erase selection").
    • data-scrapbook-elem="custom-script": mark this element as a custom script which should not be altered by the editor (e.g. not erased via "erase selection").
    • data-scrapbook-elem="custom-script-safe": mark it as a safe custom script for page resaving, and won't trigger a warning, besides custom-script.
  • data-scrapbook-id: an ID to group related special elements together. Elements with the same ID are treated as a whole.

And, similarly, special comments:

  • <!--scrapbook-erased-<time>=<content>-->: the HTML content erased with WebScrapBook, and can be reverted by WebScrapBook page editor.

Note templates

Below placeholders can be used in a note template and will be replaced with a dynamic generated value when creating a note:

  • %%: a literal %.
  • %NOTE_TITLE%: item title, or filename (without ".html").
  • %SCRAPBOOK_DIR%: relative path to the scrapbook root directory. For example, use <link rel="stylesheet" href="%SCRAPBOOK_DIR%/common/common.css"> to refer to a shared common stylesheet of the scrapbook.
  • %TREE_DIR%: relative path to the tree directory.
  • %DATA_DIR%: relative path to the data directory.
  • %ITEM_DIR%: relative path to root directory of the item.