Skip to content

UniversalDependencies/UD_Hebrew-HTB

Repository files navigation

Summary

A Universal Dependencies Corpus for Hebrew.

Introduction

Universal Dependencies - Hebrew Dependency Treebank (v2) https://github.com/UniversalDependencies/UD_Hebrew

V1 for the the corpus was built by semi-automatic conversion of the Hebrew Constituency Treebank (v2). V2 is converted from V1, using a combination of automatic conversion when possible, and manual conversion and verification in other cases.

Structure

This directory contains a corpus of sentences annotated using Universal Dependencies annotation. The corpus comprises 115,535 tokens (158,855 words) and 6,216 sentences, taken from the Ha'aretz newspaper. The trees were manually annotated into phrase-structure trees, and then semi-automatically converted into Universal Dependencies.

This file is compatible with the CoNLL-U format defined for Universal Dependencies. See: http://universaldependencies.github.io/docs/format.html . However, at present the files do not include lemmas for words. These may be added in a later release.

The dependency taxonomy can be found on the Universal Dependencies web site:

The Train/Dev/Test split follows previous splits of the underlying Treebank, namely: sentences 1-484 dev (10,534 tokens), 485-5725 train (127,363 tokens), 5726-6216 test (11,386 tokens).

Some parts of the structure are more reliable than others. In particular, words with a "morphological feature" entry of HebSource=ConvUncertainHead or HebSource=ConvUncertainLabel indicate that the head (label) information for this token is based on unreliable information.

Fixes

To help improve the corpus, please alert us to any errors you find in it; contact Yoav Goldberg at [email protected] or Reut Tsarfaty at [email protected]

Known issues

  • Does not yet fully annotate enhanced dependencies.

Acknowledgments

The Universal Dependencies Hebrew Treebank created by: (in alphabetic order):

  • Yoav Goldberg

  • Reut Tsarfaty

  • The following people were also involved in the creation of v2:

  • Amir More (adding Lemmas, detokenization, v1->v2 conversion)

  • Yuval Pinter (documentation, v2.12 fix guidelines)

  • Shoval Sadde (documentation, v2 validation and conversion)

  • Victoria Basmov (v2 validation and conversion)

The Universal Dependencies Hebrew Treebank is based on the Hebrew Constituency Treebank (v2) developed by MILA, The Knowledge Center for Processing Hebrew (http://www.mila.cs.technion.ac.il/resources_treebank.html).

References

You are encouraged to cite these papers if you use the Hebrew Universal Dependencies Treebank:

    @inproceedings{tsarfaty2013unified,
        title={A Unified Morpho-Syntactic Scheme of Stanford Dependencies},
        author={Tsarfaty, Reut},
        booktitle={Proc. of ACL},
        year={2013}
    }

    @inproceedings{mcdonald2013universal,
        title={Universal Dependency Annotation for Multilingual Parsing},
        author={McDonald, Ryan T and Nivre, Joakim and Quirmbach-Brundage, Yvonne and Goldberg, Yoav and Das, Dipanjan and Ganchev, Kuzman and Hall, Keith B and Petrov, Slav and Zhang, Hao and T{\"a}ckstr{\"o}m, Oscar and others},
        booktitle={Proc. of ACL},
        year={2013}
    }

Note that these papers do not accurately reflect the current annotation in the Treebank. A more up-to-date publication is forthcoming.

Changelog

  • v2.15

    • Construction annotations in the UCxn framework added to MISC
      • This release adds rule-based annotations of Interrogatives, Conditionals, Existentials, and NPN (noun-preposition-noun) constructions on the head of the respective phrase, plus construction elements.
      • The UCxn v1 notation and categories are documented here.
  • v2.12

    • Removed auxiliaries and copulae with children
    • Fixed meta-data errors
    • Fixed projection errors
    • Fixed discrepancies between edge types and POS tags
    • ... And many other small fixes
  • v2.12

    • Removed auxiliaries and copulae with children
    • Fixed meta-data errors
    • Fixed projection errors
    • Fixed discrepancies between edge types and POS tags
    • ... And many other small fixes
  • v2.8

    • Fixed validation issues with lang-spec relations and features.
    • Attribute HebSource moved from FEATS to MISC; same for undocumented Xtra=Junk.
    • HebExistential changed from True to Yes as with other boolean features in UD.
  • v2.2

    • Repository renamed from UD_Hebrew to UD_Hebrew-HTB.
  • v2.0

    • Conversion to UD v2 guidelines.
  • v1.2

    • Fixed a labeling bug.
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v1.1
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: news
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Goldberg, Yoav; Tsarfaty, Reut; More, Amir; Sadde, Shoval; Basmov, Victoria; Pinter, Yuval
Contributing: elsewhere
Contact: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
===============================================================================