-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
69 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,74 @@ | ||
# Description | ||
Coming soon. | ||
# AMALGUM | ||
## Download | ||
Latest data for is available under [`amalgum/`](https://github.com/gucorpling/amalgum/tree/master/amalgum) and [`amalgum_balanced`](https://github.com/gucorpling/amalgum/tree/master/amalgum_balanced). (The `_balanced` variant contains nearly 500,000 tokens for each genre, while the unbalanced variant contains slightly more data.) | ||
|
||
Full data is available undder [`amalgum/`](https://github.com/gucorpling/amalgum/tree/master/amalgum). | ||
You may [download the data **without Reddit texts** as a zip](https://drive.google.com/file/d/1StyZjJ6u84vZgJ2bIgsuCb037zc36RXB/view?usp=sharing). The complete corpus, with Reddit data, is available upon request: please email [[email protected]](mailto:[email protected]). | ||
|
||
**NOTE**: `xml/` ([GUM](https://github.com/amir-zeldes/gum)-style TEI XML), `dep/` ([CONLLU](https://universaldependencies.org/format.html)), and `tsv/` ([WebAnno-formatted](https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#sect_webannotsv) entities) are finished, but `rst/` (RS3-formatted Rhetorical Structure Theory) is not yet finished are still coming. (ETA: 6/11/2020.) | ||
## Description | ||
AMALGUM (**A** **M**achine-**A**nnotated **L**ookalike of [**GUM**](https://github.com/amir-zeldes/gum)) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers. | ||
|
||
### Genres | ||
Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename: | ||
|
||
* `academic`: [MDPI](https://www.mdpi.com) | ||
* `bio`: [Wikipedia](http://en.wikipedia.org) | ||
* `fiction`: [Project Gutenberg](https://www.gutenberg.org) | ||
* `interview`: [Wikinews, Interview category](https://en.wikinews.org/wiki/Category:Interview) | ||
* `news`: [Wikinews](https://en.wikinews.org) | ||
* `reddit`: [Reddit](https://www.reddit.com) | ||
* `whow`: [wikiHow](https://www.wikihow.com) | ||
* `voyage`: [wikiVoyage](https://en.wikivoyage.org) | ||
|
||
### Annotations | ||
AMALGUM contains annotations for the following information: | ||
|
||
* Tokenization | ||
* [UD](https://universaldependencies.org/u/pos/) and [Extended PTB](https://corpling.uis.georgetown.edu/ptb_tags.html) part of speech tags | ||
* Lemmas | ||
* [UD](https://universaldependencies.org/u/dep/) dependency parses | ||
* (Non-)named entities | ||
* Coreference | ||
* Rhetorical structure theory | ||
|
||
These annotations are across four file formats: [GUM-style XML](https://github.com/gucorpling/amalgum/blob/master/amalgum/xml/amalgum_bio_doc124.xml), [CONLLU](https://github.com/gucorpling/amalgum/blob/master/amalgum/dep/amalgum_bio_doc124.conllu), [WebAnno TSV](https://github.com/gucorpling/amalgum/blob/master/amalgum/tsv/amalgum_bio_doc124.tsv), and [RS3](https://github.com/gucorpling/amalgum/blob/master/amalgum/rst/amalgum_bio_doc124.rs3). | ||
|
||
### Further Information | ||
Please see [our paper](https://www.aclweb.org/anthology/2020.lrec-1.648.pdf). | ||
|
||
# License | ||
All annotations under the folders `amalgum/` and `amalgum_balanced/` are available under a [Creative Commons Attribution (CC-BY) license, version 4.0](https://creativecommons.org/licenses/by/4.0/). Note that their texts are sourced from the following websites under their own licenses: | ||
|
||
* `academic`: [MDPI](https://www.mdpi.com/about), [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | ||
* `bio`: [Wikipedia](https://creativecommons.org/licenses/by/4.0/), [CC BY-SA 3.0](https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License) | ||
* `fiction`: [Project Gutenberg](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License), [The Project Gutenberg License](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License#The_Full_Project_Gutenberg_License_in_Legalese_.28normative.29) | ||
* `interview`: [Wikinews](https://en.wikinews.org/wiki/Wikinews:Copyright), [CC BY 2.5](http://creativecommons.org/licenses/by/2.5/) | ||
* `news`: [Wikinews](https://en.wikinews.org/wiki/Wikinews:Copyright), [CC BY 2.5](http://creativecommons.org/licenses/by/2.5/) | ||
* `whow`: [wikiHow](https://www.wikihow.com/), [CC BY-NC-SA 3.0](http://creativecommons.org/licenses/by-nc-sa/3.0/) | ||
* `voyage`: [wikiVoyage](https://en.wikivoyage.org/wiki/Wikivoyage:Dual_licensing), [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) | ||
|
||
# Citation | ||
|
||
**NOTE:** Our [ACL Anthology](https://www.aclweb.org/anthology/2020.lrec-1.648/) listing currently uses an old name for the corpus. **Please use this corrected citation:** | ||
|
||
@inproceedings{gessler-etal-2020-amalgum, | ||
title = "{AMALGUM} {--} A Free, Balanced, and Rich {E}nglish Web Corpus", | ||
author = "Gessler, Luke and | ||
Peng, Siyao and | ||
Liu, Yang and | ||
Zhu, Yilun and | ||
Behzad, Shabnam and | ||
Zeldes, Amir", | ||
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", | ||
month = may, | ||
year = "2020", | ||
address = "Marseille, France", | ||
publisher = "European Language Resources Association", | ||
url = "https://www.aclweb.org/anthology/2020.lrec-1.648", | ||
pages = "5267--5275", | ||
abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.", | ||
language = "English", | ||
ISBN = "979-10-95546-34-4", | ||
} | ||
|
||
# Build | ||
See [BUILD.md](./BUILD.md). |