-
Notifications
You must be signed in to change notification settings - Fork 121
Scheme
Below is the data scheme WebScrapBook follows. You may also want to check the data scheme of ScrapBook X for comparison.
WebScrapBook has three top-level subdirectories for a scrapbook (root
):
-
.wsb
: to store configs for WebScrapBook or the backend server (PyWebScrapBook). -
tree
: to store metadata and static site files. -
data
: to store captured data.
Both tree
and data
are configurable in the backend server:
-
data
can beroot
or a subdirectory of any level underroot
, but can not be put under.wsb
. -
tree
can be a subdirectory of any level underroot
, and can be put under.wsb
. Although not strictly restricted, it is recommended to use the preserved.wsb/tree
when put under.wsb
, to prevent an unexpected interference of the backend server.
For simplicity, we will call them data
and tree
in this documentation.
In WebScrapBook, the corresponding index file of an item can either be */index.html
, *.htz
, *.maff
, *.html
, or *.htm
under data
or its subdirectory of any level.
The corresponding index file of an item can be freely specified via the index
property; though it's generally recommended to keep them matched. Imagine an item Wikipedia
with an index file named whiskey-idiot/index.html
... That's odd, isn't it?
There are something to note, however:
-
Resource files are only allowed for an items with an index file of
*/index.html
,*.htz
, and*.maff
.- For
*/index.html
, all files and directories under*/
are considered part of the item. - For
*.htz
, it should be a ZIP file containingindex.html
as index. - For
*.maff
, it should be a ZIP file containing*/index.rdf
or*/index.*
as index, following MAFF specification. -
*.html
is for an item of single HTML file (which has all resources embedded in). Any resource linked from it is considered external and not part of the item. -
*.htm
, when used as the index file, is reserved for a bookmark item. It should contain a meta refresh pointing to the source URL, which corresponds to thesource
property of the item.
- For
-
To prevent confusion, different items should not share a same index file.
-
To prevent confusion, the index file and resource files of an item should not reside under the directory of another item (of
*/index.html
).- More strictly, any item-containing subdirectory should not contain
index.html
, even if it's not yet defined as an item, to prevent mis-intepretation of WebScrapBook indexer. - For the same reason, an item of single HTML file (
*.html
) should not be namedindex.html
, or it may be mis-intepreted as being*/index.html
.
Take the following data structure for example:
data subdir item1 index.html ... item2.htz item3.maff item4.html item5.htm index.html
- If subdir contains index.html, subdir will be considered an item, and all files and directories, i.e. item1...5, will be treated as parts of it. In such case, item1...5 can only be the resources of the subdir item, and cannot be other items.
- On the other hand, if subdir has one or more of item1, item2, item3, item4, or item5 as item(s), it should never contain the
index.html
file.
- More strictly, any item-containing subdirectory should not contain
-
The index file can also be a meta refresh targeting to a resource file, which can be either a web page or a file.
It's been a convention for several web page saving applications to save a page as
<name>.htm(l)
with a supporting folder named<name>.files
or<name>_files
. If such pages are put under a directory like this:dir page1.htm page1_files page2.htm page2_files ...
and be directly fed to WebScrapBook indexer, they will be indexed as being items of single HTML file, which can potentially cause an unexpected behavior in the future.
For WebScrapBook to handle them correctly, put each of them in an individual subdirectory with an arbitarary
index.html
which meta-refreshes to the main page, like:dir page1 index.html (meta-refreshes to page1.htm) page1.htm page1_files page2 index.html (meta-refreshes to page2.htm) page2.htm page2_files ...
-
To prevent a cross-platform compatibililty issue:
- Chars not supported on some platforms should not be included in a filename, such as
:
,"
,?
,*
,\
,|
,<
,>
, and control chars (0x00
–0x1F
,0x7F
,0x80
–0x9F
). - Files with names different only in cases should not exist in the same directory, such as
myimage.jpg
andMyImage.JPG
.
- Chars not supported on some platforms should not be included in a filename, such as
There are three types of index file that holds the metadata, table of contents, and fulltext cache of the scrapbook items, which roughly corresponds to scrapbook.rdf
and cache.rdf
in a scrapbook of legacy ScrapBook.
-
meta#.js
: stores item metadata. -
toc#.js
: stores table of contents. -
fulltext#.js
: stores fulltext cache.
Each type of index file can be separated into multiple files with #
being empty or a positive integer in sequential. The content of an index file should contain a single call of scrapbook.<fn>(<data>)
with <data>
being a JSON-compatible text. This is to ensure cross-browser compatibility for static site files.
-
map.html
: an index page that dynamically loads index files and lists all items in the scrapbook. -
index.html
: an index page with static table of contents for SEO and for support of a browser without JavaScript. It dymanically loads index files when JavaScript is enabled anyway, and runs slower thanmap.html
. -
frame.html
: a framed page withmap.html
being the sidebar TOC and a main frame to show an item. -
search.html
: the page for searching.
Above pages are automatically generated by WebScrapBook and should not be edited. Their behavior can be customized using a corresponding <name>.css
and <name>.js
.
Additional entries for static site files:
-
feed.atom
: RSS feed file for recently updated items. -
icon
: icons for static site pages. -
favicon
: cached favicons for items with embedded favicon.
-
templates
: template files of the scrapbook. Currentlynote_template.html
for an HTML note andnote_template.md
for a markdown note are supported.
-
config.ini
: config of the backend server. -
locks
: writing protection lock files. -
themes
: custom themes of the backend server. -
server
: runtime data and cache of the backend server. -
backup
: backup files.
Legacy ScrapBook uses <ID>/index.dat
for an item. In WebScrapBook, item metadata are recorded directly in the root element of its index file (*/index.html
, or index.html
inside *.htz
or *.maff
, or the single HTML file):
-
data-scrapbook-id
: ID of the item. This can be omitted and be automatically managed by the extension. -
data-scrapbook-type
: type of the item. This can be omitted for a regular page item, but required for other types such as "file" or "note". -
data-scrapbook-source
: source URL of the item. -
data-scrapbook-create
: create datetime of the item. -
data-scrapbook-modify
: modify datetime of the item. This can be omitted and the modified time of the file will be used as fallback. -
data-scrapbook-title
: title of the item. This can be omitted and the firsttitle
element in the page will be used as fallback. -
data-scrapbook-icon
: favicon URL of the item. This can be omitted and the favicon defined in the page will be used as fallback. -
data-scrapbook-charset
: charset of the item. This is usually used by text files. -
data-scrapbook-comment
: comment of the item.
WebScrapBook stores all item metadata in tree index files (see above tree
section), and updates them when a scrapbook item is modified. Above attributes are not mandatory and only represents the initial status when the item is created, and can be useful when importing an item to another scrapbook as WebScrapBook site indexer will recognize them. Note that this behavior differs from legacy ScrapBook, which updates the index.dat
besides scrapbook.rdf
whenever an item is modified.
For a resourced item, the index file can contain a meta refresh pointing to the real main item page. However, item metadata should be recorded in the index file rather than the meta refreshed page in such case.
Below attributes record the original status of a web page captured at <time>
before rewritten by WebScrapBook, if related capture options are specified:
-
data-scrapbook-orig-null-node-<time>
: the element does not exist in the original page. -
data-scrapbook-orig-null-attr-<attr>-<time>
: the element does not have the attribute<attr>
in the original page. -
data-scrapbook-orig-attr-<attr>-<time>
: value of attribute<attr>
of the element in the original page. -
data-scrapbook-orig-textContent-<time>
: text content of the element in the original page.
Following attributes record certain information not representable by standard HTML in the original page:
-
data-scrapbook-css-disabled
: the stylesheet element is disabled in the original page. -
data-scrapbook-input-checked
: checked status of the input element in the original page. -
data-scrapbook-input-indeterminate
: the input element is indeterminate in the original page. -
data-scrapbook-input-value
: value of the input element in the original page. -
data-scrapbook-option-selected
: selected status of the option element in the original page. -
data-scrapbook-textarea-value
: value of the textarea element in the original page. -
data-scrapbook-canvas
: image data of the canvas element in the original page. -
data-scrapbook-shadowdom
: shadow DOM content of the element in the original page. -
data-scrapbook-shadowdom-mode
:mode
property of the shadow DOM of the element in the original page. -
data-scrapbook-shadowdom-clonable
:clonable
property of the shadow DOM of the element in the original page. -
data-scrapbook-shadowdom-delegates-focus
:delegatesFocus
property of the shadow DOM of the element in the original page. -
data-scrapbook-shadowdom-serializable
:serializable
property of the shadow DOM of the element in the original page. -
data-scrapbook-shadowdom-slot-assignment
:slotAssignment
property of the shadow DOM of the element in the original page. -
data-scrapbook-adoptedstylesheet-<index>
: the<index>
th constructed stylesheets in the original page. -
data-scrapbook-adoptedstylesheets
: constructed stylesheets used by the element related DOM in the original page. -
data-scrapbook-slot-assigned
: indexes of nodes manually assigned to the SLOT element in the original page. -
data-scrapbook-slot-index
: index of this element, which is manually assigned to a SLOT element, in the original page.
And, similarly, special comments:
-
<!--scrapbook-capture-selected-->
...<!--/scrapbook-capture-selected-->
: the selected range(s) when "Capture selection" is performed by WebScrapBook. -
<!--scrapbook-capture-selected-splitter-->
...<!--/scrapbook-capture-selected-splitter-->
: the splitter between different selected ranges in a same text-like node when "Capture selection" is performed by WebScrapBook. -
<!--scrapbook-slot-index=<index>-->
...<!--scrapbook-slot-index-->
: index of this text node, which is manually assigned to a SLOT element, in the original page.
Below are special attributes used by WebScrapBook for editting and saving a page:
-
data-scrapbook-elem
: define this element as a special element, added by WebScrapBook or the user, such as a loader, a highlighting, an annotation, etc. Some names are specifically for custom usage and can be arbitarary edited:-
data-scrapbook-elem="todo"
: mark this element (usually a checkbox) as a TODO element. -
data-scrapbook-elem="title"
: for the main item page (the index page or the page meta-refreshed from index page), text content of the element updates to match item title when saved. -
data-scrapbook-elem="title-src"
: for the main item page, text content of the element overwrites item title and otherdata-scrapbook-elem="title"
elements when saved. -
data-scrapbook-elem="custom"
: mark this element as a custom annotation and can be removed by WebScrapBook page editor. -
data-scrapbook-elem="custom-wrapper"
: mark this element as a custom highlighting and can be unwrapped (removing the element but keep descendants) by WebScrapBook page editor. -
data-scrapbook-elem="custom-css"
: mark this element as a custom stylesheet which should not be altered by the editor (e.g. not erased via "erase selection"). -
data-scrapbook-elem="custom-script"
: mark this element as a custom script which should not be altered by the editor (e.g. not erased via "erase selection"). -
data-scrapbook-elem="custom-script-safe"
: mark it as a safe custom script for page resaving, and won't trigger a warning, besidescustom-script
.
-
-
data-scrapbook-id
: an ID to group related special elements together. Elements with the same ID are treated as a whole.
And, similarly, special comments:
-
<!--scrapbook-erased-<time>=<content>-->
: the HTML content erased with WebScrapBook, and can be reverted by WebScrapBook page editor.
Below placeholders can be used in a note template and will be replaced with a dynamic generated value when creating a note:
-
%%
: a literal%
. -
%NOTE_TITLE%
: item title, or filename (without ".html"). -
%SCRAPBOOK_DIR%
: relative path to the scrapbook root directory. For example, use<link rel="stylesheet" href="%SCRAPBOOK_DIR%/common/common.css">
to refer to a shared common stylesheet of the scrapbook. -
%TREE_DIR%
: relative path to the tree directory. -
%DATA_DIR%
: relative path to the data directory. -
%ITEM_DIR%
: relative path to root directory of the item.