-
Notifications
You must be signed in to change notification settings - Fork 129
XML and Challenges
Wiki markdown is a line oriented content element definition. The first character of a line defines e.g. if it is an enumeration or an bullet list (itemize
in LaTeX)
this is a line 1 of a TextBlock
* this is a line 1 of a BulletList
* this is a line 2 of a BulletList
this is a line 1 of the next TextBlock
this is a line 2 of the next TextBlock with inline math <math>k^2</math>
# this is a line 1 of a EnumList
# this is a line 2 of a EnumList
the following lines define a MathBlock as separate rendered line
:<math>
\sum_{k=1}^{n} k^2
+ n^3
</math>
One problem of an line oriented parsing is, that inside the XML definitions, the regular syntax of wiki markdown is not applicable. Especially the wiki markdown is not pure syntax. Within the XML tag math
the expression has to follow the LaTeX syntax with its own grammar. The following example shows how a simple arithmetic division within a math-tag could violate a simple line oriented parsing using just the first character
of the line for check which object of an Abstract Syntax Tree (here Indentation
) should be generate from the parser.
the next to lines of the document are indented (text block shifted right).
: this is a line 1 of an Indentation
: this is a line 2 of an Indentation
Considering the next mathematical expression of an arithmetic task
to devide 12 by 4, which is equal to 3 as inline math
will be defined as <math>12 : 4 = 3</math>. Inside the MathInline we can
have new lines that corrupt the line oriented parsing. With <math>24
: 4 = 6 </math> and without tokenize the math expression this line is
interpreted as an indentation.
A first solution in previous version and 5.0 of wtf_wikipedia
is to remove XML-tags by kill_xml.js
. Mathematical expression will be removed to accomplish a robust parser.
A Tokenizer
replaces problematic XML objects like ref
-tags or math
-tags by tokens and stores the references or mathematical expression the generated JSON of wtf_wikipedia
. To understand what a Tokenizer
does in a lexical analysis see Lexical Analysis - Tokenizer
The following workaround is emulates a real tokenization in an Abstract Syntax Tree (AST).
doc.math_expr = [
" \sum_{k=1}^{n} k^2\n + n^3 ",
"12 : 4 = 3",
"24\n: 4 = 6"
]
A parser for mathematical expression will push the LaTeX code into the JSON of Document
object.
The source text will be modified by
...
the following lines define a MathBlock as separate rendered line
:___MATH_0___
the next to lines of the document are indented (text block shifted right).
: this is a line 1 of an Indentation
: this is a line 2 of an Indentation
Considering the next mathematical expression of an arithmetic task
to devide 12 by 4, which is equal to 3 as inline math
will be defined as ___MATH_1___. Inside the MathInline we can
have new lines that corrupt the line oriented parsing. With ___MATH_2___
and without tokenize the math expression this line is
interpreted as an indentation.
The tokenizer is a workaround to allow parsing of string elements in the current available structure defined by the maintainer of wtf_wikipedia
. For a bottom-up analysis (Compiler Theory) will create tree node during parsing for parsed elements, but for know at version 5.0 it seems to be an option to move forward.
Parsing of mathematical expression and tokenize these expressions can be perform on the section level in release 5.0 after parsing the XML templates (see /src/section/index.js
at l.18 ff. method doSection()
)
const doSection = function(section, wiki, options) {
wiki = parse.xmlTemplates(section, wiki, options);
// //parse the <ref></ref> tags
wiki = parse.references(section, wiki, options);
//parse-out all {{templates}}
wiki = parse.templates(section, wiki, options);
//parse-math inline "<math>...</math>" and block ":<math>...</math>"
// and tokenize the mathematical LaTeX expression and store LaTeX Code.
wiki = parse.math(section, wiki, options);
...
}
The tokens for mathematical expressions ___MATH_1___
can be replaced later at the sentence level into MathInline
tree node as an element of a ContentList
splitting a sentence into its content elements.
On the sentence level the token ___MATH_1___
is replace by Sentence
as ContentList
like:
Considering the next mathematical expression of an arithmetic task
to devide 12 by 4, which is equal to 3 as inline math
will be defined as ___MATH_1___.
The following part is a part of the Abstract Syntax Tree (AST) representing the following sentence:
{
"type":"sentence",
"value":"",
"children":[
{
"type":"text",
"value":"Considering the next ... defined as ",
"children":[]
},
{
"type":"mathinline",
"value":"12 : 4 = 3",
"children":[]
},
{
"type":"text",
"value":".",
"children":[]
}
]
}
The LaTeX code for the mathematical expression is stored in the JSON accessible with the array index and output generation will be
Bold, Italics, Underline are processed in /src/sentence/Sentence.js
. These format settings can wrap a paragraph and therefore multiple sentences or just parts of a sentence that are formatted e.g. bold. Are bold settings allowed if the wrap multiple sentences? In a MediaWiki it would be OK but it is not a good practice of writing articles. If wtf_wikipedia
wants to support multiple sentence in bold, senctence has to inherit format settings from Paragraph
or TextBlocks
(both not implemented in 5.0 as tree nodes in the AST)
First step to make parsing of mathematical expression robust can also be a removal of newlines in the mathematical expressions by:
this.removeMathNewlines = function(wikicode) {
console.log("replaceMathNewLines() "+wikicode);
if (wikicode) {
//var vSearch = /(<math[^>]*?>)(.*?)(<\/math>)/gi;
var vSearch = /(<math>)(.*?)(<\/math>)/gi;
var vResult;
var vCount =0;
console.log("wikicode defined");
while (vResult = vSearch.exec(pWikiCode)) {
vCount++;
console.log("Math Expression "+vCount+": '" + vResult[1] + "' found");
var vFound = vResult[1];
var vReplace = vFound.replace(/\n/g," ");
// replace vFound in wikicode by vReplace
this.replaceString(wikicode,vFound,vReplace);
};
}
return wikicode
}
Newlines might be valuable for comprehension of the mathematical formula and therefore should preserved especially in the exported LaTeX code even if the removed newlines do not have an impact on how the mathematical expression is rendered.
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org