XML and Challenges

Wiki markdown is a line oriented content element definition. The first character of a line defines e.g. if it is an enumeration or an bullet list (itemize in LaTeX)

this is a line 1 of a TextBlock 
* this is a line 1 of a BulletList
* this is a line 2 of a BulletList
this is a line 1 of the next TextBlock
this is a line 2 of the next TextBlock with inline math <math>k^2</math>
# this is a line 1 of a EnumList
# this is a line 2 of a EnumList
the following lines define a MathBlock as separate rendered line
:<math>
  \sum_{k=1}^{n} k^2
  + n^3  
</math>

One problem of an line oriented parsing is, that inside the XML definitions, the regular syntax of wiki markdown is not applicable. Especially the wiki markdown is not pure syntax. Within the XML tag math the expression has to follow the LaTeX syntax with its own grammar. The following example shows how a simple arithmetic division within a math-tag could violate a simple line oriented parsing using just the first character of the line for check which object of an Abstract Syntax Tree (here Indentation) should be generate from the parser.

the next to lines of the document are indented (text block shifted right). 
: this is a line 1 of an Indentation
: this is a line 2 of an Indentation
Considering the next mathematical expression of an arithmetic task
to devide 12 by 4, which is equal to 3 as inline math 
will be defined as <math>12 : 4 = 3</math>. Inside the MathInline we can 
have new lines that corrupt the line oriented parsing. With <math>24
: 4 = 6 </math> and without tokenize the math expression this line is 
interpreted as an indentation.

A first solution in previous version and 5.0 of wtf_wikipedia is to remove XML-tags by kill_xml.js. Mathematical expression will be removed to accomplish a robust parser.

Tokenizer

A Tokenizer replaces problematic XML objects like ref-tags or math-tags by tokens and stores the references or mathematical expression the generated JSON of wtf_wikipedia. To understand what a Tokenizer does in a lexical analysis see Lexical Analysis - Tokenizer The following workaround is emulates a real tokenization in an Abstract Syntax Tree (AST).

doc.math_expr = [
   "  \sum_{k=1}^{n} k^2\n  + n^3  ",
   "12 : 4 = 3",
   "24\n: 4 = 6" 
]

A parser for mathematical expression will push the LaTeX code into the JSON of Document object. The source text will be modified by

...
the following lines define a MathBlock as separate rendered line
:___MATH_0___ 
the next to lines of the document are indented (text block shifted right). 
: this is a line 1 of an Indentation
: this is a line 2 of an Indentation
Considering the next mathematical expression of an arithmetic task
to devide 12 by 4, which is equal to 3 as inline math 
will be defined as ___MATH_1___. Inside the MathInline we can 
have new lines that corrupt the line oriented parsing. With ___MATH_2___ 
and without tokenize the math expression this line is 
interpreted as an indentation.

The tokenizer is a workaround to allow parsing of string elements in the current available structure defined by the maintainer of wtf_wikipedia. For a bottom-up analysis (Compiler Theory) will create tree node during parsing for parsed elements, but for know at version 5.0 it seems to be an option to move forward.

Parsing of mathematical expression and tokenize these expressions can be perform on the section level in release 5.0 after parsing the XML templates (see /src/section/index.js at l.18 ff. method doSection())

const doSection = function(section, wiki, options) {
  wiki = parse.xmlTemplates(section, wiki, options);
  // //parse the <ref></ref> tags
  wiki = parse.references(section, wiki, options);
  //parse-out all {{templates}}
  wiki = parse.templates(section, wiki, options);
  //parse-math inline "<math>...</math>" and block ":<math>...</math>"
  // and tokenize the mathematical LaTeX expression and store LaTeX Code.
  wiki = parse.math(section, wiki, options);
  ...
}

Token to AST Tree Node

The tokens for mathematical expressions ___MATH_1___ can be replaced later at the sentence level into MathInline tree node as an element of a ContentList splitting a sentence into its content elements.

On the sentence level the token ___MATH_1___ is replace by Sentence as ContentList like:

Considering the next mathematical expression of an arithmetic task
to devide 12 by 4, which is equal to 3 as inline math 
will be defined as ___MATH_1___.

The following part is a part of the Abstract Syntax Tree (AST) representing the following sentence:

       {
           "type":"sentence",
           "value":"",
           "children":[
                        {
                           "type":"text",
                           "value":"Considering the next ... defined as ",
                           "children":[]
                        },
                        {
                           "type":"mathinline",
                           "value":"12 : 4 = 3",
                           "children":[]
                        },
                        {
                           "type":"text",
                           "value":".",
                           "children":[]
                        }
              ]
      }

The LaTeX code for the mathematical expression is stored in the JSON accessible with the array index and output generation will be

Bold, Italics, Underline

Bold, Italics, Underline are processed in /src/sentence/Sentence.js. These format settings can wrap a paragraph and therefore multiple sentences or just parts of a sentence that are formatted e.g. bold. Are bold settings allowed if the wrap multiple sentences? In a MediaWiki it would be OK but it is not a good practice of writing articles. If wtf_wikipedia wants to support multiple sentence in bold, senctence has to inherit format settings from Paragraph or TextBlocks (both not implemented in 5.0 as tree nodes in the AST)

Remove Newlines in Mathematical Expressions

First step to make parsing of mathematical expression robust can also be a removal of newlines in the mathematical expressions by:

this.removeMathNewlines = function(wikicode) {
	console.log("replaceMathNewLines() "+wikicode);
	if (wikicode) {
		//var vSearch = /(<math[^>]*?>)(.*?)(<\/math>)/gi;
		var vSearch = /(<math>)(.*?)(<\/math>)/gi;
		var vResult;
		var vCount =0;
		console.log("wikicode defined");
		while (vResult = vSearch.exec(pWikiCode)) {
			vCount++;
			console.log("Math Expression "+vCount+": '" + vResult[1] + "' found");
			var vFound = vResult[1];
			var vReplace = vFound.replace(/\n/g," ");
			// replace vFound in wikicode by vReplace
			this.replaceString(wikicode,vFound,vReplace);
		};
	}
	return wikicode
}

Newlines might be valuable for comprehension of the mathematical formula and therefore should preserved especially in the exported LaTeX code even if the removed newlines do not have an impact on how the mathematical expression is rendered.

Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
Output: Based on concepts of the swiss-army knife of document conversion developed by John MacFarlane PanDoc - https://www.pandoc.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML and Challenges

Tokenizer

Token to AST Tree Node

Bold, Italics, Underline

Remove Newlines in Mathematical Expressions

Clone this wiki locally