-
Notifications
You must be signed in to change notification settings - Fork 129
Tokenizer
The tokenizer may be implemented as a micro library wtf_tokenizer
with the Wiki Transformation Framework
(WTF).
The tokenizer converts a XML sections like REF-tag and mathematical expression wrapped in MATH-Tag into attribtues of the generated JSON,
text before math <MATH>
\sum_{i=1}^{\infty} [x_i]
: v_i
</MATH> text after math.
text before <ref>my reference ...</ref> and text after
cite an already defined reference with <ref name="MyLabel"/> text after citation.
into
text before math ___MATH_INLINE_7238234792_ID_5___ text after math.
text before ___CITE_7238234792_ID_3___ and text after
cite an already defined reference with ___CITE_7238234792_MyLabel___ text after citation.
The challenge of parsing can be identified in the mathematical expression. The colon :
in the first column of the line defines an indentation. But within a mathematical expression it is just a devision.
The number 7238234792 is a unique integer generated by the time and date in milliseconds, to make the marker unique. Mathematical Expressions, Citation and References are extracted in the preProcess()
-Call. The tokenizer is encapsuled in /src/01-document/preProcess/tokenizer.js
.
The tokens/markers are regarded as ordinary words in text. The markers can be replaced in the postProcess or even, when the output is generated with toHTML()
or toMarkDown
, because during output the final numbering of citations can be generated, if more that on articles are downloaded and aggregated.
So it makes sense, that the markers/tokens remain even in the JSON sentences, sections and paragraphs until the final output is generated. Currently in my test repository, I do not populate the doc.references
but I populate data.refs4token
in the same way as you populate doc.references
but it adds the label for backwards replacment for output. So I've added the corresponding label (e.g. ___CITE_7238234792_ID_3___
or ___MATH_INLINE_7238234792_ID_5___
to references in data.refs4token
, so that later the markers for citations can be replaced by [6]
in the IEEE citation style. A replacement of a citation in APA-Style will create e.g.(Kelly 2018)
on call of doc.text()
or doc.html()
. The same would be performed for mathematical inline and block expressions, they need the original location of the mathematical expression e.g. in senctence (e.g. ___MATH_INLINE_7238234792_ID_5___
).
You mentioned that you affected by the parsing order all over the place. With this concept you can get rid of the parsing problems because XML in REF-Tags and Latex in MATH-tags is removed and stored for further use in the JSON. At the same time the Marker/Tokenize concept preserves the position of JSON content in the original wiki source.
This needs the introduction of toJSON()
method that replaces the content in the key-values pairs in the doc
JSON-file.
The robustness to parsing order seems to very good and could save us some headaches, because it extracts the MATH-tags and REF-tags already in the preProcess
step in which they were removed currently as well. Furthermore this preserves position in the text and the mathematical expression itself with block
or inline
type and label attribute for mathematical content without loosing the position of the math exps in the wiki source.
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org