-
Notifications
You must be signed in to change notification settings - Fork 129
Tokenizer
The tokenizer may be implemented as a micro library wtf_tokenizer
with the Wiki Transformation Framework
(WTF).
The tokenizer converts a XML sections like REF-tag and mathematical expression wrapped in MATH-Tag into attribtues of the generated JSON,
text before math <MATH>
\sum_{i=1}^{\infty} [x_i]
: v_i
</MATH> text after math.
text before <ref>my reference ...</ref> and text after
cite an already defined reference with <ref name="MyLabel"/> text after citation.
into
text before math ___MATH_INLINE_7238234792_ID_5___ text after math.
text before ___CITE_7238234792_ID_3___ and text after
cite an already defined reference with ___CITE_7238234792_MyLabel___ text after citation.
The challenge of parsing can be identified in the mathematical expression. The colon :
in the first column of the line defines an indentation. But within a mathematical expression it is just a devision.
The number 7238234792 is a unique integer generated by the time and date in milliseconds, to make the marker unique. Mathematical Expressions, Citation and References are extracted in the preProcess()
-Call. The tokenizer is encapsuled in /src/01-document/preProcess/tokenizer.js
.
The tokens/markers are regarded as ordinary words in text. The markers can be replaced in the postProcess or even, when the output is generated with toHTML()
or toMarkDown
, because during output the final numbering of citations can be generated, if more that on articles are downloaded and aggregated.
So it makes sense, that the markers/tokens remain even in the JSON sentences, sections and paragraphs until the final output is generated. Currently in my test repository, I do not populate the doc.references
but I populate data.refs4token
in the same way as you populate doc.references
but it adds the label for backwards replacment for output. So I've added the corresponding label (e.g. ___CITE_7238234792_ID_3___
or ___MATH_INLINE_7238234792_ID_5___
to references in data.refs4token
, so that later the markers for citations can be replaced by [6]
in the IEEE citation style. A replacement of a citation in APA-Style will create e.g.(Kelly 2018)
on call of doc.text()
or doc.html()
. The same would be performed for mathematical inline and block expressions, they need the original location of the mathematical expression e.g. in senctence (e.g. ___MATH_INLINE_7238234792_ID_5___
).
You mentioned that you affected by the parsing order all over the place. With this concept you can get rid of the parsing problems because XML in REF-Tags and Latex in MATH-tags is removed and stored for further use in the JSON. At the same time the Marker/Tokenize concept preserves the position of JSON content in the original wiki source.
This needs the introduction of toJSON()
method that replaces the content in the key-values pairs in the doc
JSON-file.
The robustness to parsing order seems to very good and could save us some headaches, because it extracts the MATH-tags and REF-tags already in the preProcess
step in which they were removed currently as well. Furthermore this preserves position in the text and the mathematical expression itself with block
or inline
type and label attribute for mathematical content without loosing the position of the math exps in the wiki source.
-
Step 1:
wtf_fetch()
based oncross-fetch
fetches the wiki source-
Input:
-
language="en"
orlanguage="de"
to specify the language of the wiki source -
domain="wikipedia"
ordomain="wikiversity"
ordomain="wikispecies"
to select the wiki domain for the Wikifetch()
call to pulls the wiki sources from.
-
-
Input:
-
Output:
- wiki source text e.g. from
wikipedia
orwikiversity
- Remark:
wtf_fetch
extracts yourwtf.fetch()
in a separate module.
- wiki source text e.g. from
-
Step 2:
wtf_tokenize()
-
Input:
- wiki source text e.g. from
wikipedia
orwikiversity
fetched bywtf_fetch
- wiki source text e.g. from
-
Output:
- wiki source text where e.g. mathematical expressions are replaced by tokens like
MATH-INLINE-839832492834_N12
.wtf_wikipedia
treats those tokens just as words in a sentence.
- wiki source text where e.g. mathematical expressions are replaced by tokens like
-
Input:
-
Step 3:
wtf_wikipedia()
-
Input:
- wiki source text with tokenized citations and mathematical expressions
-
Output: object
doc
of typeDocument
. Application of output methods fortext
,html
,latex
,json
containing the tokens as words in sentences. The tokens appear in the output ofdoc. html()
ordoc.latex()
inwtf_wikipedia
and in the JSON as well.
-
Input:
-
Step 4:
wtf_tokenize
-
Input:
- string in the export format, text with tokenized citations and mathematical expressions
-
Output: detokenized export format in the
out
string is injected in the DeTokenizer w.g.detokenize.html(out,data,options)
. In this case the output strintout
is already in the HTML format. In the outputout
or in any other desired output format (e.g.markdown
) the token replacement is performed e.g. for HTML the mathematical expressions are exported to MathJax and e.g. for latex the detokenizer replaces the word/tokenMATH-INLINE-839832492834_N12
by$\sum_{n=0}^{\infty} \frac{x^n}{n!}$
.
-
Input:
It seems that the only step is, that the constructors for the AST treenode e.g. Reference
, MathFormula
, .... should be extendable with additional export formats e.g. doc.reveal()
that visits the AST tree nodes Section
, Paragraph
, Sentence
, .... and calls the appropriate toReveal()
function for the node.
Might be sufficient to add it the following way:
Document.reveal = function () {
....
};
Document.Section.reveal = function () {
....
};
Document.Section.Paragraph.reveal = function () {
....
};
...
Section
might own different constructors for tree nodes of the AST (Abstract Syntax Tree), so
Document.Section.Table.reveal = function () {
....
};
resp. assigned to paragraph level.
Document.Section.Paragraph.Table.reveal = function () {
....
};
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org