Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent error handling #32

Open
hiroshinoji opened this issue Mar 14, 2016 · 4 comments
Open

Consistent error handling #32

hiroshinoji opened this issue Mar 14, 2016 · 4 comments

Comments

@hiroshinoji
Copy link
Contributor

Here is a proposal for how to keep track errors on the output XML when some errors are detected.

Example:

<chunks annotators="cabocha" errors="cabocha">
<error by="cabocha">error message</error>
</chunks>

That is, an error message is surrounded by <error>, which keeps the annotator causing the error.

This design may handle the situation where multiple annotators annotate the same XML element and only one of them fails in annotation:

<tokens annotators="ssplit tokenize pos" errors="pos">
<token id="0" offsetBegin="0" offsetEnd="1">I</token>
...
<error by="pos">error message</error>
</tokens>

errors attribute in each element may be redundant but seems useful to check errors. I'm not sure.

@hiroshinoji
Copy link
Contributor Author

When a error is detected at higher level in the pipeline (e.g., tokenize), it seems natural that the lower level annotators (e.g., pos) annotate nothing and just ignore that sentence (or a document, if that contains sentences with errors).

Or the output keeps all <error> tags for each annotator? This seems somewhat redundant.

@hiroshinoji
Copy link
Contributor Author

One problem of this approach is that, e.g., <tokens> has elements other than <token> as a child.
Here is another proposal:

<sentence id="s0">
  <tokens annotators="ssplit tokenize pos" errors="e0">
  ...
  </tokens>
<erorrs>
  <error id="e0" by="pos">...</error>
</errors>
</sentence>

Another merit of this approach is that we can refer to the same error message from different elements, e.g., chunks, dependencies, etc of knp.

@hiroshinoji
Copy link
Contributor Author

This is the final design now accepted in 038c850.

<sentence id="s0">
  <tokens .../>
  <error annotator="knp">...</error>
</sentence>

We do not record error id, and also links between elements on which the error occurs and <error>.

Basically each annotator is agnostic about annotating <error> tag, and it is SentenceAnnotator or DocumentAnnotator that annotates <error> for a problematic sentence or document.

In the current implementation, only AnnotationError thrown in each annotator is caught, and is converted to <error> tag. This might be changed to catch all errors during annotation?

This is a concrete example, which occurs when * is given to knp and juman does not convert half space chars (-juman.normalize false).

<root>
  <document id="d0">
    <sentences>
      <sentence id="s0">
        *
        <tokens annotators="juman" normalized="false">
          <token id="s0_tok0" form="*" characterOffsetBegin="0" characterOffsetEnd="1" yomi="*" lemma="*" pos="未定義語" posId="15" pos1="その他" pos1Id="1" cType="*" cTypeId="0" cForm="*" cFormId="0" misc="NIL"/>
        </tokens>
        <error annotator="knp">jigg.pipeline.ProcessError: ;; Invalid input &lt;* * * 未定義語 15 その他 1 * 0 * 0 NIL &gt; ! # S-ID:2 KNP:4.12-CF1.1 DATE:2016/03/16 SCORE:0.00000 ERROR:Cannot make mrph EOS</error>
      </sentence>
    </sentences>
  </document>
</root>

Error message of KNP is recorded in the text of <error>.

@hiroshinoji
Copy link
Contributor Author

TODO: check whether error handling works correctly for CoreNLP.
One issue is that now all (sub)annotators in CoreNLP are DocumentAnnotator, which means if some error (e.g., parse error) occurs on a sentence, probably the analysis of the whole document is failed. Or unexpected behavior may occur if some error is handled (e.g., giving too long sentences?) internally in some annotator of CoreNLP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant