-
Notifications
You must be signed in to change notification settings - Fork 4
ICS02: 7. Using Treebanks
Thursday Feb 21, 16:00 UK = 18:00 EET
Convenors: Timo Korkiakangas (Helsinki) & Marco Passarotti (UCSC, Milan)
YouTube link: https://youtu.be/EFWxTfkdzVA
Slides: Korkiakangas: Querying Perseus Treebanks; Passarotti: Universal Dependencies
The objective of this session is two-fold. First, it introduces Universal Dependencies (UD). Second, it gives a practical example of how to query treebanks to answer simple research questions.
-
UD (http://universaldependencies.org/) is one of the most notable projects currently ongoing in computational linguistics. The project, run by contributors from the research community, aims at creating a collection of dependency treebanks for different languages built according to a cross-linguistically consistent annotation style meant to complement (but not to replace) the single language/treebank-specific schemes. Started in 2014 with the first set of guidelines, UD has published a new release of the collection of the treebanks roughly every six months. Version 2 (v2), which introduces a new set of guidelines, was released in March 2017. The current version is 2.3 (November 2018). It includes 129 treebanks and 76 languages. The session will introduce the basic aspects of the annotation style of UD v2 as well as the format of source data.
-
There are several software tools that can be used to query dependency treebanks. We present a use case that illustrates a simplified treebank query from the set-up of research question to the interpretation of results. To exploit all the levels of linguistic annotation, a powerful query syntax is needed. We will use the PML Tree Query extension (http://ufal.mff.cuni.cz/tred/documentation/ar01-toc.html) of TrEd Treebank Editor (https://ufal.mff.cuni.cz/tred/) on data from the Latin Dependency Treebanks. The annotation of the Latin Dependency Treebanks is based on the Perseus Guidelines (https://github.com/PerseusDL/treebank_data/blob/master/v1/latin/docs/guidelines.pdf), not on UD. Given that not all treebanks are yet converted into universal dependencies, one has to tackle various types of treebank annotation. This means understanding the underlying annotation principles. After the session, the student will understand the logical steps involved in querying treebanks for study/research purposes.
- Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of LREC-2016, pp. 1659-1666: http://www.lrec-conf.org/proceedings/lrec2016/pdf/348_Paper.pdf
- González Saavedra, B. and Passarotti, M. 2018. "Using Tectogrammatical Annotation for Studying Actors and Actions in Sallust's Bellum Catilinae." The Prague Bulletin of Mathematical Linguistics 111, 5-28: https://ufal.mff.cuni.cz/pbml/111/art-saavedra-passarotti.pdf (pay attention to research setting and querying rather than to technical details)
- Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies: http://universaldependencies.org/conll17/proceedings/
- EACL 2017 Tutorial on Universal Dependencies: http://universaldependencies.org/eacl17tutorial/
- NoDaLiDa Workshop on Universal Dependencies (UDW 2017): http://universaldependencies.org/udw17/program.html
- Universal Dependencies Workshop 2018 (UDW 2018): http://universaldependencies.org/udw18/
- Mambrini, F. and Passarotti, M. 2016. "Subject-Verb Agreement with Coordinated Subjects in Ancient Greek. A Treebank-Based Study." Journal of Greek Linguistics 16:1, 87–116. Available: http://booksandjournals.brillonline.com/content/journals/10.1163/15699846-01601003
- Discuss the treatment of one specific syntactic construction according to the UD annotation guidelines (possibly with examples from multiple, typologically different languages): http://universaldependencies.org/guidelines.html
- Run the on-line web application of the NLP pipeline UDPipe (http://lindat.mff.cuni.cz/services/udpipe/) on a couple of texts in different languages. Evaluate the results manually.
- OR Try your hand in PML-TQ in PML-TQ web client, where the Preseus Latin Treebank and the Index Thomisticus Treebank are available in UD, at http://lindat.mff.cuni.cz/services/pmltq/#!/home (instructions at http://ufal.mff.cuni.cz/pmltqdoc/doc/pmltq_tutorial_web_client.html).