-
Notifications
You must be signed in to change notification settings - Fork 129
Abstract Syntax Tree as JSON
In wtf_wikipedia
the concept of an Abstract Syntax Tree (AST) is a tree representation of the WikiMedia syntax of the source text in Wiki Markdown. The data in the Wiki article in stored in a JSON structure. The JSON is valuable for data management of extracted content elements. The AST can be represented in a JSON as well, in which each node of the tree denotes a content element (e.g. paragraph, header/title, image, mathematical expression) occurring in the source text downloaded via the MediaWiki API with wtf.fetch(...)
. The syntax tree is "abstract" in a sense that is not representing a special output format in detail (e.g. HTML, LaTeX, MarkDown,...). The AST nodes are encode in the Wiki Markdown and tree structure can be used to generate the different output formats by application of the appropriate tree node handler for the title, image, sentences to the AST.
Similar to programming languages the abstract syntax trees can be derived from a concrete syntax trees, traditionally generated by parsing a given string compliant with a "strange" grammar of the Wiki source text.
A Wikipedia or Wikiversity article is decomposed into an AST in the following order.
The current documentation is refering to wtf_wikipedia
version 7.1
A wiki source article is parsed and the root node of the AST is of type Document
.
- The root node of the Abstract Syntax Tree (AST)
Document
stored in an object in/src/01-document/index.js
const main = function(wiki, options) {
options = options || {};
wiki = wiki || '';
let r = {
type: 'page',
sections: [],
interwiki: {},
categories: [],
coordinates: [],
citations: []
};
...
- The
Document
object is created insrc/01-document/document.js
. - A
Document
object is decomposed in sections defined insrc/02-section
as objectSection
- A
Section
object is decomposed in paragraphs defined insrc/03-paragraph
as objectParagraph
- A
Paragraph
object is decomposed in sentences defined insrc/04-sentence
as objectSentence
- A
Sentence
object could contain other subobject likeTable
s insrc/table
or subobjects likeReferences
,Citations
or mathematical expressionMathExpr
(currently not defined).
The root node of the AST is the r
object. It is populated with the sections, citations, ... during parsing of the wiki source document. See method createNode4AST()
for a generic approach.
The downloaded Wiki sources of an article is stored in wtf.wiki
. The wiki source text is processed into an abstract syntax tree that could be stored in wtf.tree
. If developers want to export the JSON AST into a well formated text file use:
var vTreeString = JSON.stringify(wtf.tree,null,4);
- The string of the ASt can be used in the test cases in
tests/basis_ast.test.js
. - For debugging and designing new export formats.
The following example is currently not an available feature in wtf_wikipedia
. It could serve as basis for further generation of other export formats for formats that will never be implemented in wtf_wikipedia
.
Wiki Page >-> wtf_wikipedia.js
>-> AST >-> ast2odf.js
>-> Open Document Format
{
"language": "en",
"domain": "wikiversity",
"article": "Water",
"ast": [
{
"type":"page",
"value":"",
"children":[
{
"type":"sentence",
"value":"My first sentence.",
"children":[]
},
{
"type":"sentence",
"value":"My Second sentence.",
"children":[]
},
]
},
{
"type":"title",
"value":"My Title",
"children":null
},
{
"type":"math",
"value":"\\sum_{k=1}^{n} k^2",
"children":[]
},
]
}
The method createNode4AST()
creates a very simple node for the Abstract Syntax Tree (AST) by returning a hash with just the type
attribute. This AST node can be populated with additional attributes that may be relevant for generation of output formats. E.g. a Section
needs a title
and depth
attribute as implemented in /sec/section/index.js
l.40 with function splitSections()
.
const createNode4AST = function(nodeid) {
return {
"type":nodeid
}
}
A tree node for Section
will be created with createNode4AST("Section")
and populated with more content parsed in the section body. The contentlist
(is this equivalent to templates
attribute in section
hash??? or are these really templates).
A node-specific constructor could use switch
command for adding type specific additional attributes.
const createNode4AST = function(nodeid) {
let ast_node = {
"type":nodeid
};
switch (nodeid) {
case "Paragraph","TextBlock","BulletList","Sentence":
ast_node.contentlist = new ContentList()
break;
case "EnumList":
ast_node.enumtype = "arabic"; //used for output: > 1. 2. 3.
// enumtype = "alph" > a) b) c) - not used in release 5.0
ast_node.contentlist = new ContentList()
break;
case "Section":
ast_node.title = "";
ast_node.depth = -1;
ast_node.templates = [];
ast_node.contentlist = new ContentList()
break;
default:
}
return ast_node
}
The main branching element for parsing the children is the ContentList
. It decomposes the part of the wiki source in a sequence content elements (branches of the tree node). The elements in the content list can processed further and decomposed into sub trees by parsing the decomposed wiki text fragments. ContentList
performes two abstract tasks:
-
block level: decomposition into
Paragraph
,Table
,EnumList
,BulletList
,MathBlock
, ... -
sentence level: decomposition into sentence fragments
Bold
,Italics
,Underline
,MathInline
,ImageInline
(like icons), ...
- MediaWiki Parser
Parsoid
(see https://www.mediawiki.org/wiki/Parsoid) parses Wiki-Markdown to HTML and HTML to Wiki-Markdown.parsoid
can be used to replace the parser insidewtf_wikipedia
with the output formatHTML
with just the Wiki markdown fetchwtf_fetch
and the application ofparsoid
on the downloaded/fetched wiki source text. - Use a generic parser generator for specific expressions e.g. with PEGJS. Can be used to generate a Javascript parser for specific purpose in
wtf_wikipedia
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org