Skip to content
Danny Lin edited this page Apr 14, 2020 · 21 revisions

Data scheme

Below is the data scheme WebScrapBook follows. You may also want to check the data scheme of ScrapBook X, for comparison.

The scrapbook directory

WebScrapBook has three top-level subdirectories for a scrapbook (root):

  • .wsb: to store configs for WebScrapBook or the backend server (PyWebScrapBook).
  • tree: to store metadata and static site files.
  • data: to store captured data.

Both tree and data are configurable in the backend server:

  • data can be root or a subdirectory of any level under root, but can not be put under .wsb.
  • tree can be a subdirectory of any level under root, and can be put under .wsb. Although not strictly restricted, it is recommended to use the preserved .wsb/tree when put under .wsb, to prevent an unexpected interference of the backend server.

For simplicity, we will call them data and tree in this documentation.

The data directory

In WebScrapBook, the corresponding index file of an item can either be */index.html, *.htz, *.maff, *.html, or *.htm under data or its subdirectory of any level.

The corresponding index file of an item can be freely specified via the index property; though it's generally recommended to keep them matched. Imagine an item Wikipedia with an index file named whiskey-idiot/index.html... That's odd, isn't it?

There are something to note, however:

  • Resource files are only allowed for an items with an index file of */index.html, *.htz, and *.maff.

    • For */index.html, all files and directories under */ are considered part of the item.
    • For *.htz, it should be a ZIP file containing index.html as index.
    • For *.maff, it should be a ZIP file containing */index.rdf or */index.* as index, following MAFF specification.
    • *.html is for an item of single HTML file (which has all resources embedded in). Any resource linked from it is considered external and not part of the item.
    • *.htm, when used as the index file, is reserved for a bookmark item. It should contain a meta refresh pointing to the source file, which corresponds to the source property of the item.
  • To prevent confusion, different items should not share a same index file.

  • To prevent confusion, the index file and resource files of an item should not reside under the directory of another item (of */index.html).

    • More strictly, any item-containing subdirectory should not contain index.html, even if it's not yet defined as an item, to prevent mis-intepretation of WebScrapBook indexer.
    • For the same reason, an item of single HTML file (*.html) should not be named index.html, or it may be mis-intepreted as being */index.html.

    Take the following data structure for example:

    data
        subdir
            item1
                index.html
                ...
            item2.htz
            item3.maff
            item4.html
            item5.htm
            index.html
    
    • If subdir contains index.html, subdir will be considered an item, and all files and directories, i.e. item1...5, will be treated as parts of it. In such case, item1...5 can only be the resources of the subdir item, and cannot be other items.
    • On the other hand, if subdir has one or more of item1, item2, item3, item4, or item5 as item(s), it should never contain the index.html file.
  • The index file can also be a meta refresh targeting to a resource file, which can be either a web page or a file.

    It's been a convention for several web page saving applications to save a page as <name>.htm(l) with a supporting folder named <name>.files or <name>_files. If such pages are put under a directory like this:

    dir
        page1.htm
        page1_files
        page2.htm
        page2_files
        ...
    

    and be directly fed to WebScrapBook indexer, they will be indexed as being items of single HTML file, which can potentially cause an unexpected behavior in the future.

    For WebScrapBook to handle them correctly, put each of them in an individual subdirectory with an arbitarary index.html which meta-refreshes to the main page, like:

    dir
        page1
            index.html  (meta-refreshes to page1.htm)
            page1.htm
            page1_files
        page2
            index.html  (meta-refreshes to page2.htm)
            page2.htm
            page2_files
        ...
    

The tree directory

Index files

There are three types of index file that holds the metadata, table of contents, and fulltext cache of the scrapbook items, which roughly corresponds to scrapbook.rdf and cache.rdf in a scrapbook of legacy ScrapBook.

  • meta#.js: stores item metadata.
  • toc#.js: stores table of contents.
  • fulltext#.js: stores fulltext cache.

Each type of index file can be separated into multiple files with # being empty or a positive integer in sequential. The content of an index file should contain a single call of scrapbook.<fn>(<data>) with <data> being a JSON-compatible text. This is to ensure cross-browser compatibility for static site files.

Static site files

  • map.html: an index page that dynamically loads index files and lists all items in the scrapbook.
  • index.html: an index page with static table of contents for SEO and for support of a browser without JavaScript. It dymanically loads index files when JavaScript is enabled anyway, and runs slower than map.html.
  • frame.html: a framed page with map.html being the sidebar TOC and a main frame to show an item.
  • search.html: the page for searching.

Above pages are automatically generated by WebScrapBook and should not be edited. Their behavior can be customized using a corresponding <name>.css and <name>.js.

Additional entries for static site files:

  • feed.atom: RSS feed file for recently updated items.
  • icon: icons for static site pages.
  • favicon: cached favicons for items with embedded favicon.

Other entries

  • templates: template files of the scrapbook. Currently note_template.html for an HTML note and note_template.md for a markdown note are supported.

The .wsb directory

  • config.ini: config of the backend server.
  • themes: custom themes of the backend server.
  • server: runtime data and cache of the backend server.

Metadata for a scrapbook item

Legacy ScrapBook uses <ID>/index.dat for an item. In WebScrapBook, item metadata are recorded directly in the root element of its index file (*/index.html, or index.html inside *.htz or *.maff, or the single HTML file):

  • data-scrapbook-type: type of the item. This can be omitted for a regular page item, but required for other types such as "file" or "note".
  • data-scrapbook-source: source URL of the item.
  • data-scrapbook-create: create date of the item.
  • data-scrapbook-title: title of the item. This can be omitted and the first title element in the page will be used as fallback.
  • data-scrapbook-icon: favicon URL of the item. This can be omitted and the favicon defined in the page will be used as fallback.

WebScrapBook stores all item metadata in tree index files (see above tree section), and updates them when a scrapbook item is modified. Above attributes are not mandatory and only represents the initial status when the item is created, and can be useful when importing an item to another scrapbook as WebScrapBook site indexer will recognize them. Note that this behavior differs from legacy ScrapBook, which updates the index.dat besides scrapbook.rdf whenever an item is modified.

For a resourced item, the index file can contain a meta refresh pointing to the real main item page. However, item metadata should be recorded in the index file rather than the meta refreshed page in such case.

Special attributes for a scrapbook page

Below attributes record the original status of a captured page before rewritten by WebScrapBook, if related capture options are specified:

  • data-scrapbook-orig-attr-<attr>-<time>: original value of attribute <attr> of the element in the original page.
  • data-scrapbook-orig-null-attr-<time>: the element does not have the attribute <attr> in the original page.
  • data-scrapbook-orig-textContent-<time>: text content of the element in the original page.

Following attributes record certain information not recorded by standard HTML in the original page:

  • data-scrapbook-css-disabled: whether the stylesheet element is disabled in the original page.
  • data-scrapbook-canvas: image data of the canvas element in the original page.
  • data-scrapbook-shadowroot: shadowroot content of the element in the original page.

Special attributes for page editor and saver

Below are special attributes WebScrapBook uses when editting and saving a page:

  • data-scrapbook-elem: define this element as a special element, added by WebScrapBook or the user, such as a loader, a highlighting, an annotation, etc. Some names are specifically for custom usage and can be arbitarary edited:
    • data-scrapbook-elem="todo": mark this element as a TODO element. A form element (such as input[type="checkbox"]) with this mark has its current value saved when the page is saved.
    • data-scrapbook-elem="title": for the main item page (the index page or the page meta-refreshed from index page), text content of the element updates to match item title when saved.
    • data-scrapbook-elem="title-src": for the main item page, text content of the element overwrites item title and other data-scrapbook-elem="title" elements when saved.
    • data-scrapbook-elem="custom": mark this element as a custom annotation and can be removed by WebScrapBook page editor.
    • data-scrapbook-elem="custom-wrapper": mark this element as a custom highlighting and can be unwrapped (removing the element but keep descendants) by WebScrapBook page editor.
  • data-scrapbook-id: an ID to group related special elements together. Elements with the same ID are treated as a whole.

For downward compatibility, WebScrapBook can recognize legacy attributes used by ScrapBook X (e.g. data-sb-*) and legacy classNames used by ScrapBook (like class="linemarker"), but won't create them when editing a page.

And, similarly, special comments:

  • <!--scrapbook-erased-<time>=<content>-->: the HTML content erased with WebScrapBook, and can be reverted by WebScrapBook page editor.

Note templates

Below placeholders can be used in a note template and will be replaced with a dynamic generated value when creating a note:

  • %%: a literal %.
  • %NOTE_TITLE%: item title, or filename (without ".html").
  • %SCRAPBOOK_DIR%: relative path to the scrapbook root directory. For example, use <link rel="stylesheet" href="%SCRAPBOOK_DIR%/common/common.css"> to refer to a shared common stylesheet of the scrapbook.
  • %TREE_DIR%: relative path to the tree directory.
  • %DATA_DIR%: relative path to the data directory.
  • %ITEM_DIR%: relative path to root directory of the item.
Clone this wiki locally