Tokenizer

The tokenizer may be implemented as a micro library wtf_tokenizer with the Wiki Transformation Framework (WTF).

Tokenizer for Syntax Domains

The tokenizer converts a XML sections like REF-tag and mathematical expression wrapped in MATH-Tag into attribtues of the generated JSON,

text before math <MATH> 
\sum_{i=1}^{\infty} [x_i] 
: v_i 
</MATH> text after math.
text before <ref>my reference ...</ref> and text after 
cite an already defined reference with <ref name="MyLabel"/> text after citation.

into

text before math ___MATH_INLINE_7238234792_ID_5___ text after math.
text before ___CITE_7238234792_ID_3___ and text after    
cite an already defined reference with ___CITE_7238234792_MyLabel___ text after citation.

The challenge of parsing can be identified in the mathematical expression. The colon : in the first column of the line defines an indentation. But within a mathematical expression it is just a devision.

Uniqness of Markers with Time Stamp

The number 7238234792 is a unique integer generated by the time and date in milliseconds, to make the marker unique. Mathematical Expressions, Citation and References are extracted in the preProcess()-Call. The tokenizer is encapsuled in /src/01-document/preProcess/tokenizer.js. The tokens/markers are regarded as ordinary words in text. The markers can be replaced in the postProcess or even, when the output is generated with toHTML() or toMarkDown, because during output the final numbering of citations can be generated, if more that on articles are downloaded and aggregated.

So it makes sense, that the markers/tokens remain even in the JSON sentences, sections and paragraphs until the final output is generated. Currently in my test repository, I do not populate the doc.references but I populate data.refs4token in the same way as you populate doc.references but it adds the label for backwards replacment for output. So I've added the corresponding label (e.g. ___CITE_7238234792_ID_3___ or ___MATH_INLINE_7238234792_ID_5___ to references in data.refs4token, so that later the markers for citations can be replaced by [6] in the IEEE citation style. A replacement of a citation in APA-Style will create e.g.(Kelly 2018) on call of doc.text() or doc.html(). The same would be performed for mathematical inline and block expressions, they need the original location of the mathematical expression e.g. in senctence (e.g. ___MATH_INLINE_7238234792_ID_5___).

You mentioned that you affected by the parsing order all over the place. With this concept you can get rid of the parsing problems because XML in REF-Tags and Latex in MATH-tags is removed and stored for further use in the JSON. At the same time the Marker/Tokenize concept preserves the position of JSON content in the original wiki source.

This needs the introduction of toJSON() method that replaces the content in the key-values pairs in the docJSON-file.

The robustness to parsing order seems to very good and could save us some headaches, because it extracts the MATH-tags and REF-tags already in the preProcess step in which they were removed currently as well. Furthermore this preserves position in the text and the mathematical expression itself with block or inline type and label attribute for mathematical content without loosing the position of the math exps in the wiki source.

Tokizer Steps and Workflow - Recommendation

Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
Output: Based on concepts of the swiss-army knife of document conversion developed by John MacFarlane PanDoc - https://www.pandoc.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer

Tokenizer for Syntax Domains

Uniqness of Markers with Time Stamp

Tokizer Steps and Workflow - Recommendation

Clone this wiki locally