-
Notifications
You must be signed in to change notification settings - Fork 121
Scheme
Below is the data scheme WebScrapBook follows. You may also want to check the data scheme of ScrapBook X for comparison.
WebScrapBook has three top-level subdirectories for a scrapbook (root
):
-
.wsb
: to store configs for WebScrapBook or the backend server (PyWebScrapBook). -
tree
: to store metadata and static site files. -
data
: to store captured data.
Both tree
and data
are configurable in the backend server:
-
data
can beroot
or a subdirectory of any level underroot
, but can not be put under.wsb
. -
tree
can be a subdirectory of any level underroot
, and can be put under.wsb
. Although not strictly restricted, it is recommended to use the preserved.wsb/tree
when put under.wsb
, to prevent an unexpected interference of the backend server.
For simplicity, we will call them data
and tree
in this documentation.
In WebScrapBook, the corresponding index file of an item can either be */index.html
, *.htz
, *.maff
, *.html
, or *.htm
under data
or its subdirectory of any level.
The corresponding index file of an item can be freely specified via the index
property; though it's generally recommended to keep them matched. Imagine an item Wikipedia
with an index file named whiskey-idiot/index.html
... That's odd, isn't it?
There are something to note, however:
-
Resource files are only allowed for an items with an index file of
*/index.html
,*.htz
, and*.maff
.- For
*/index.html
, all files and directories under*/
are considered part of the item. - For
*.htz
, it should be a ZIP file containingindex.html
as index. - For
*.maff
, it should be a ZIP file containing*/index.rdf
or*/index.*
as index, following MAFF specification. -
*.html
is for an item of single HTML file (which has all resources embedded in). Any resource linked from it is considered external and not part of the item. -
*.htm
, when used as the index file, is reserved for a bookmark item. It should contain a meta refresh pointing to the source URL, which corresponds to thesource
property of the item.
- For
-
To prevent confusion, different items should not share a same index file.
-
To prevent confusion, the index file and resource files of an item should not reside under the directory of another item (of
*/index.html
).- More strictly, any item-containing subdirectory should not contain
index.html
, even if it's not yet defined as an item, to prevent mis-intepretation of WebScrapBook indexer. - For the same reason, an item of single HTML file (
*.html
) should not be namedindex.html
, or it may be mis-intepreted as being*/index.html
.
Take the following data structure for example:
data subdir item1 index.html ... item2.htz item3.maff item4.html item5.htm index.html
- If subdir contains index.html, subdir will be considered an item, and all files and directories, i.e. item1...5, will be treated as parts of it. In such case, item1...5 can only be the resources of the subdir item, and cannot be other items.
- On the other hand, if subdir has one or more of item1, item2, item3, item4, or item5 as item(s), it should never contain the
index.html
file.
- More strictly, any item-containing subdirectory should not contain
-
The index file can also be a meta refresh targeting to a resource file, which can be either a web page or a file.
It's been a convention for several web page saving applications to save a page as
<name>.htm(l)
with a supporting folder named<name>.files
or<name>_files
. If such pages are put under a directory like this:dir page1.htm page1_files page2.htm page2_files ...
and be directly fed to WebScrapBook indexer, they will be indexed as being items of single HTML file, which can potentially cause an unexpected behavior in the future.
For WebScrapBook to handle them correctly, put each of them in an individual subdirectory with an arbitarary
index.html
which meta-refreshes to the main page, like:dir page1 index.html (meta-refreshes to page1.htm) page1.htm page1_files page2 index.html (meta-refreshes to page2.htm) page2.htm page2_files ...
-
To prevent a cross-platform compatibililty issue:
- Chars not supported on some platforms should not be included in a filename, such as
:
,"
,?
,*
,\
,|
,<
,>
, and control chars (0x00
–0x1F
,0x7F
). - Files with names different only in cases should not exist in the same directory, such as
myimage.jpg
andMyImage.JPG
.
- Chars not supported on some platforms should not be included in a filename, such as
There are three types of index file that holds the metadata, table of contents, and fulltext cache of the scrapbook items, which roughly corresponds to scrapbook.rdf
and cache.rdf
in a scrapbook of legacy ScrapBook.
-
meta#.js
: stores item metadata. -
toc#.js
: stores table of contents. -
fulltext#.js
: stores fulltext cache.
Each type of index file can be separated into multiple files with #
being empty or a positive integer in sequential. The content of an index file should contain a single call of scrapbook.<fn>(<data>)
with <data>
being a JSON-compatible text. This is to ensure cross-browser compatibility for static site files.
-
map.html
: an index page that dynamically loads index files and lists all items in the scrapbook. -
index.html
: an index page with static table of contents for SEO and for support of a browser without JavaScript. It dymanically loads index files when JavaScript is enabled anyway, and runs slower thanmap.html
. -
frame.html
: a framed page withmap.html
being the sidebar TOC and a main frame to show an item. -
search.html
: the page for searching.
Above pages are automatically generated by WebScrapBook and should not be edited. Their behavior can be customized using a corresponding <name>.css
and <name>.js
.
Additional entries for static site files:
-
feed.atom
: RSS feed file for recently updated items. -
icon
: icons for static site pages. -
favicon
: cached favicons for items with embedded favicon.
-
templates
: template files of the scrapbook. Currentlynote_template.html
for an HTML note andnote_template.md
for a markdown note are supported.
-
config.ini
: config of the backend server. -
locks
: writing protection lock files. -
themes
: custom themes of the backend server. -
server
: runtime data and cache of the backend server. -
backup
: backup files.
Legacy ScrapBook uses <ID>/index.dat
for an item. In WebScrapBook, item metadata are recorded directly in the root element of its index file (*/index.html
, or index.html
inside *.htz
or *.maff
, or the single HTML file):
-
data-scrapbook-type
: type of the item. This can be omitted for a regular page item, but required for other types such as "file" or "note". -
data-scrapbook-source
: source URL of the item. -
data-scrapbook-create
: create date of the item. -
data-scrapbook-title
: title of the item. This can be omitted and the firsttitle
element in the page will be used as fallback. -
data-scrapbook-icon
: favicon URL of the item. This can be omitted and the favicon defined in the page will be used as fallback.
WebScrapBook stores all item metadata in tree index files (see above tree
section), and updates them when a scrapbook item is modified. Above attributes are not mandatory and only represents the initial status when the item is created, and can be useful when importing an item to another scrapbook as WebScrapBook site indexer will recognize them. Note that this behavior differs from legacy ScrapBook, which updates the index.dat
besides scrapbook.rdf
whenever an item is modified.
For a resourced item, the index file can contain a meta refresh pointing to the real main item page. However, item metadata should be recorded in the index file rather than the meta refreshed page in such case.
Below attributes record the original status of a captured page before rewritten by WebScrapBook, if related capture options are specified:
-
data-scrapbook-orig-null-node-<time>
: the element does not exist in the original page. -
data-scrapbook-orig-null-attr-<attr>-<time>
: the element does not have the attribute<attr>
in the original page. -
data-scrapbook-orig-attr-<attr>-<time>
: value of attribute<attr>
of the element in the original page. -
data-scrapbook-orig-textContent-<time>
: text content of the element in the original page.
Following attributes record certain information not representable by standard HTML in the original page:
-
data-scrapbook-css-disabled
: the stylesheet element is disabled in the original page. -
data-scrapbook-input-checked
: checked status of the input element in the original page. -
data-scrapbook-input-indeterminate
: the input element is indeterminate in the original page. -
data-scrapbook-input-value
: value of the input element in the original page. -
data-scrapbook-option-selected
: selected status of the option element in the original page. -
data-scrapbook-textarea-value
: value of the textarea element in the original page. -
data-scrapbook-canvas
: image data of the canvas element in the original page. -
data-scrapbook-shadowdom
: shadow DOM content of the element in the original page.
Below are special attributes used by WebScrapBook for editting and saving a page:
-
data-scrapbook-elem
: define this element as a special element, added by WebScrapBook or the user, such as a loader, a highlighting, an annotation, etc. Some names are specifically for custom usage and can be arbitarary edited:-
data-scrapbook-elem="todo"
: mark this element (usually a checkbox) as a TODO element. -
data-scrapbook-elem="title"
: for the main item page (the index page or the page meta-refreshed from index page), text content of the element updates to match item title when saved. -
data-scrapbook-elem="title-src"
: for the main item page, text content of the element overwrites item title and otherdata-scrapbook-elem="title"
elements when saved. -
data-scrapbook-elem="custom"
: mark this element as a custom annotation and can be removed by WebScrapBook page editor. -
data-scrapbook-elem="custom-wrapper"
: mark this element as a custom highlighting and can be unwrapped (removing the element but keep descendants) by WebScrapBook page editor. -
data-scrapbook-elem="custom-css"
: mark this element as a custom stylesheet which should not be altered by the capturer or the editor (e.g. not rewritten or removed when captured, and not erased via "erase selection"). -
data-scrapbook-elem="custom-script"
: mark this element as a custom script which should not be altered by the capturer or the editor (e.g. not rewritten or removed when captured, and not erased via "erase selection"). -
data-scrapbook-elem="custom-script-safe"
: mark it as a safe custom script for page resaving, and won't trigger a warning, besidescustom-script
.
-
-
data-scrapbook-id
: an ID to group related special elements together. Elements with the same ID are treated as a whole.
For downward compatibility, WebScrapBook can recognize legacy attributes used by ScrapBook X (e.g. data-sb-*
) and legacy classNames used by ScrapBook (like class~="linemarker-marked-line"
), but won't create them when editing a page.
And, similarly, special comments:
-
<!--scrapbook-erased-<time>=<content>-->
: the HTML content erased with WebScrapBook, and can be reverted by WebScrapBook page editor.
Below placeholders can be used in a note template and will be replaced with a dynamic generated value when creating a note:
-
%%
: a literal%
. -
%NOTE_TITLE%
: item title, or filename (without ".html"). -
%SCRAPBOOK_DIR%
: relative path to the scrapbook root directory. For example, use<link rel="stylesheet" href="%SCRAPBOOK_DIR%/common/common.css">
to refer to a shared common stylesheet of the scrapbook. -
%TREE_DIR%
: relative path to the tree directory. -
%DATA_DIR%
: relative path to the data directory. -
%ITEM_DIR%
: relative path to root directory of the item.