forked from UniversalDependencies/UD_Hebrew-HTB
-
Notifications
You must be signed in to change notification settings - Fork 0
Hebrew Universal Dependencies Treebank
License
clab/UD_Hebrew-HTB
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# Summary A Universal Dependencies Corpus for Hebrew. # Introduction Universal Dependencies - Hebrew Dependency Treebank (v2) https://github.com/UniversalDependencies/UD_Hebrew V1 for the the corpus was built by semi-automatic conversion of the Hebrew Constituency Treebank (v2). V2 is converted from V1, using a combination of automatic conversion when possible, and manual conversion and verification in other cases. # Structure This directory contains a corpus of sentences annotated using Universal Dependencies annotation. The corpus comprises 115,535 tokens (158,855 words) and 6,216 sentences, taken from the `Ha'aretz` newspaper. The trees were manually annotated into phrase-structure trees, and then semi-automatically converted into Universal Dependencies. This file is compatible with the CoNLL-U format defined for Universal Dependencies. See: http://universaldependencies.github.io/docs/format.html . However, at present the files do not include lemmas for words. These may be added in a later release. The dependency taxonomy can be found on the Universal Dependencies web site: http://universaldependencies.github.io/docs/ http://universaldependencies.github.io/docs/#language-he The Train/Dev/Test split follows previous splits of the underlying Treebank, namely: sentences 1-484 dev (10,534 tokens), 485-5725 train (127,363 tokens), 5726-6216 test (11,386 tokens). Some parts of the structure are more reliable than others. In particular, words with a "morphological feature" entry of HebSource=ConvUncertainHead or HebSource=ConvUncertainLabel indicate that the head (label) information for this token is based on unreliable information. # Fixes To help improve the corpus, please alert us to any errors you find in it; contact Yoav Goldberg at [email protected] or Reut Tsarfaty at [email protected] # Known issues - Does not yet fully annotate enhanced dependencies. # Acknowledgments The Universal Dependencies Hebrew Treebank created by: (in alphabetic order): - Yoav Goldberg - Reut Tsarfaty - The following people were also involved in the creation of v2: - Amir More (adding Lemmas, detokenization, v1->v2 conversion) - Yuval Pinter (documentation) - Shoval Sadde (documentation, v2 validation and conversion) - Victoria Basmov (v2 validation and conversion) The Universal Dependencies Hebrew Treebank is based on the Hebrew Constituency Treebank (v2) developed by MILA, The Knowledge Center for Processing Hebrew. (http://www.mila.cs.technion.ac.il/resources_treebank.html) ## References You are encouraged to cite these papers if you use the Hebrew Universal Dependencies Treebank: @inproceedings{tsarfaty2013unified, title={A Unified Morpho-Syntactic Scheme of Stanford Dependencies}, author={Tsarfaty, Reut}, booktitle={Proc. of ACL}, year={2013} } @inproceedings{mcdonald2013universal, title={Universal Dependency Annotation for Multilingual Parsing}, author={McDonald, Ryan T and Nivre, Joakim and Quirmbach-Brundage, Yvonne and Goldberg, Yoav and Das, Dipanjan and Ganchev, Kuzman and Hall, Keith B and Petrov, Slav and Zhang, Hao and T{\"a}ckstr{\"o}m, Oscar and others}, booktitle={Proc. of ACL}, year={2013} } Note that these papers do not accurately reflect the current annotation in the Treebank. A more up-to-date publication is forthcoming. # Changelog * v2.8 * Fixed validation issues with lang-spec relations and features. * Attribute HebSource moved from FEATS to MISC; same for undocumented Xtra=Junk. * HebExistential changed from True to Yes as with other boolean features in UD. * v2.2 * Repository renamed from UD_Hebrew to UD_Hebrew-HTB. * v2.0 * Conversion to UD v2 guidelines. * v1.2 * Fixed a labeling bug. <pre> === Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v1.1 License: CC BY-NC-SA 4.0 Includes text: yes Genre: news Lemmas: converted from manual UPOS: converted from manual XPOS: manual native Features: converted from manual Relations: converted from manual Contributors: Goldberg, Yoav; Tsarfaty, Reut; More, Amir; Sadde, Shoval; Basmov, Victoria Contributing: elsewhere Contact: [email protected], [email protected], [email protected], [email protected], [email protected] =============================================================================== </pre>
About
Hebrew Universal Dependencies Treebank
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published