Consistent error handling #32

hiroshinoji · 2016-03-14T08:48:05Z

Here is a proposal for how to keep track errors on the output XML when some errors are detected.

Example:

<chunks annotators="cabocha" errors="cabocha">
<error by="cabocha">error message</error>
</chunks>

That is, an error message is surrounded by <error>, which keeps the annotator causing the error.

This design may handle the situation where multiple annotators annotate the same XML element and only one of them fails in annotation:

<tokens annotators="ssplit tokenize pos" errors="pos">
<token id="0" offsetBegin="0" offsetEnd="1">I</token>
...
<error by="pos">error message</error>
</tokens>

errors attribute in each element may be redundant but seems useful to check errors. I'm not sure.

The text was updated successfully, but these errors were encountered:

hiroshinoji · 2016-03-14T09:03:52Z

When a error is detected at higher level in the pipeline (e.g., tokenize), it seems natural that the lower level annotators (e.g., pos) annotate nothing and just ignore that sentence (or a document, if that contains sentences with errors).

Or the output keeps all <error> tags for each annotator? This seems somewhat redundant.

hiroshinoji · 2016-03-14T09:20:15Z

One problem of this approach is that, e.g., <tokens> has elements other than <token> as a child.
Here is another proposal:

<sentence id="s0">
  <tokens annotators="ssplit tokenize pos" errors="e0">
  ...
  </tokens>
<erorrs>
  <error id="e0" by="pos">...</error>
</errors>
</sentence>

Another merit of this approach is that we can refer to the same error message from different elements, e.g., chunks, dependencies, etc of knp.

hiroshinoji · 2016-03-16T07:11:16Z

This is the final design now accepted in 038c850.

<sentence id="s0">
  <tokens .../>
  <error annotator="knp">...</error>
</sentence>

We do not record error id, and also links between elements on which the error occurs and <error>.

Basically each annotator is agnostic about annotating <error> tag, and it is SentenceAnnotator or DocumentAnnotator that annotates <error> for a problematic sentence or document.

In the current implementation, only AnnotationError thrown in each annotator is caught, and is converted to <error> tag. This might be changed to catch all errors during annotation?

This is a concrete example, which occurs when * is given to knp and juman does not convert half space chars (-juman.normalize false).

<root>
  <document id="d0">
    <sentences>
      <sentence id="s0">
        *
        <tokens annotators="juman" normalized="false">
          <token id="s0_tok0" form="*" characterOffsetBegin="0" characterOffsetEnd="1" yomi="*" lemma="*" pos="未定義語" posId="15" pos1="その他" pos1Id="1" cType="*" cTypeId="0" cForm="*" cFormId="0" misc="NIL"/>
        </tokens>
        <error annotator="knp">jigg.pipeline.ProcessError: ;; Invalid input &lt;* * * 未定義語 15 その他 1 * 0 * 0 NIL &gt; ! # S-ID:2 KNP:4.12-CF1.1 DATE:2016/03/16 SCORE:0.00000 ERROR:Cannot make mrph EOS</error>
      </sentence>
    </sentences>
  </document>
</root>

Error message of KNP is recorded in the text of <error>.

hiroshinoji · 2016-05-30T02:42:53Z

TODO: check whether error handling works correctly for CoreNLP.
One issue is that now all (sub)annotators in CoreNLP are DocumentAnnotator, which means if some error (e.g., parse error) occurs on a sentence, probably the analysis of the whole document is failed. Or unexpected behavior may occur if some error is handled (e.g., giving too long sentences?) internally in some annotator of CoreNLP?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent error handling #32

Consistent error handling #32

hiroshinoji commented Mar 14, 2016

hiroshinoji commented Mar 14, 2016

hiroshinoji commented Mar 14, 2016

hiroshinoji commented Mar 16, 2016

hiroshinoji commented May 30, 2016

Consistent error handling #32

Consistent error handling #32

Comments

hiroshinoji commented Mar 14, 2016

hiroshinoji commented Mar 14, 2016

hiroshinoji commented Mar 14, 2016

hiroshinoji commented Mar 16, 2016

hiroshinoji commented May 30, 2016