gum

Repository for the Georgetown University Multilayer Corpus (GUM)

This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

Directories

The corpus is downloadable in multiple formats. Not all formats contain all annotations. The most complete XML representation is in PAULA XML, and the easiest way to search in the corpus is using ANNIS. Other formats may be useful for other purposes. See website for more details.

_build/ - The GUM build bot and utilities for data merging and validation (see https://corpling.uis.georgetown.edu/gum/build.html)
annis/ - The entire merged corpus, with all annotations, as a relANNIS 3.3 corpus dump, importable into ANNIS (see http://corpus-tools.org/annis)
const/ - Constituent trees and PTB POS tags in the PTB bracketing format (automatic parser output)
coref/ - Entity and coreference annotation in two formats: conll shared task tabular format (with no bridging annotations) and WebAnno .tsv format, including entity and information status annotations, bridging and singleton entities
dep/ - Dependency trees in the conll 10 column format using Stanford Typed Dependencies (manually corrected) and extended PTB POS tags (following TreeTagger/Amalgam, e.g. tags like VVZ), as well as speaker and sentence type annotations
paula/ - The entire merged corpus in PAULA standoff XML, with all annotations (see https://www.sfb632.uni-potsdam.de/en/paula.html for format documentation)
rst/ - Rhetorical Structure Theory analyses in .rs3 format as used by RSTTool and rstWeb (spaces between words correspond to tokenization in rest of corpus)
xml/ - vertical XML representations with 1 token or tag per line and tab delimited lemmas and POS tags (extended, VVZ style, vanilla and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gum

Directories

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
_build		_build
annis		annis
const		const
coref		coref
dep		dep
paula/GUM		paula/GUM
rst		rst
xml		xml
LICENSE.txt		LICENSE.txt
README.md		README.md

License

logan-siyao-peng/gum

Folders and files

Latest commit

History

Repository files navigation

gum

Directories

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages