Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
lgessler authored Jun 18, 2020
1 parent 05024fd commit a703b95
Showing 1 changed file with 69 additions and 4 deletions.
73 changes: 69 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,74 @@
# Description
Coming soon.
# AMALGUM
## Download
Latest data for is available under [`amalgum/`](https://github.com/gucorpling/amalgum/tree/master/amalgum) and [`amalgum_balanced`](https://github.com/gucorpling/amalgum/tree/master/amalgum_balanced). (The `_balanced` variant contains nearly 500,000 tokens for each genre, while the unbalanced variant contains slightly more data.)

Full data is available undder [`amalgum/`](https://github.com/gucorpling/amalgum/tree/master/amalgum).
You may [download the data **without Reddit texts** as a zip](https://drive.google.com/file/d/1StyZjJ6u84vZgJ2bIgsuCb037zc36RXB/view?usp=sharing). The complete corpus, with Reddit data, is available upon request: please email [[email protected]](mailto:[email protected]).

**NOTE**: `xml/` ([GUM](https://github.com/amir-zeldes/gum)-style TEI XML), `dep/` ([CONLLU](https://universaldependencies.org/format.html)), and `tsv/` ([WebAnno-formatted](https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#sect_webannotsv) entities) are finished, but `rst/` (RS3-formatted Rhetorical Structure Theory) is not yet finished are still coming. (ETA: 6/11/2020.)
## Description
AMALGUM (**A** **M**achine-**A**nnotated **L**ookalike of [**GUM**](https://github.com/amir-zeldes/gum)) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers.

### Genres
Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename:

* `academic`: [MDPI](https://www.mdpi.com)
* `bio`: [Wikipedia](http://en.wikipedia.org)
* `fiction`: [Project Gutenberg](https://www.gutenberg.org)
* `interview`: [Wikinews, Interview category](https://en.wikinews.org/wiki/Category:Interview)
* `news`: [Wikinews](https://en.wikinews.org)
* `reddit`: [Reddit](https://www.reddit.com)
* `whow`: [wikiHow](https://www.wikihow.com)
* `voyage`: [wikiVoyage](https://en.wikivoyage.org)

### Annotations
AMALGUM contains annotations for the following information:

* Tokenization
* [UD](https://universaldependencies.org/u/pos/) and [Extended PTB](https://corpling.uis.georgetown.edu/ptb_tags.html) part of speech tags
* Lemmas
* [UD](https://universaldependencies.org/u/dep/) dependency parses
* (Non-)named entities
* Coreference
* Rhetorical structure theory

These annotations are across four file formats: [GUM-style XML](https://github.com/gucorpling/amalgum/blob/master/amalgum/xml/amalgum_bio_doc124.xml), [CONLLU](https://github.com/gucorpling/amalgum/blob/master/amalgum/dep/amalgum_bio_doc124.conllu), [WebAnno TSV](https://github.com/gucorpling/amalgum/blob/master/amalgum/tsv/amalgum_bio_doc124.tsv), and [RS3](https://github.com/gucorpling/amalgum/blob/master/amalgum/rst/amalgum_bio_doc124.rs3).

### Further Information
Please see [our paper](https://www.aclweb.org/anthology/2020.lrec-1.648.pdf).

# License
All annotations under the folders `amalgum/` and `amalgum_balanced/` are available under a [Creative Commons Attribution (CC-BY) license, version 4.0](https://creativecommons.org/licenses/by/4.0/). Note that their texts are sourced from the following websites under their own licenses:

* `academic`: [MDPI](https://www.mdpi.com/about), [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
* `bio`: [Wikipedia](https://creativecommons.org/licenses/by/4.0/), [CC BY-SA 3.0](https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)
* `fiction`: [Project Gutenberg](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License), [The Project Gutenberg License](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License#The_Full_Project_Gutenberg_License_in_Legalese_.28normative.29)
* `interview`: [Wikinews](https://en.wikinews.org/wiki/Wikinews:Copyright), [CC BY 2.5](http://creativecommons.org/licenses/by/2.5/)
* `news`: [Wikinews](https://en.wikinews.org/wiki/Wikinews:Copyright), [CC BY 2.5](http://creativecommons.org/licenses/by/2.5/)
* `whow`: [wikiHow](https://www.wikihow.com/), [CC BY-NC-SA 3.0](http://creativecommons.org/licenses/by-nc-sa/3.0/)
* `voyage`: [wikiVoyage](https://en.wikivoyage.org/wiki/Wikivoyage:Dual_licensing), [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)

# Citation

**NOTE:** Our [ACL Anthology](https://www.aclweb.org/anthology/2020.lrec-1.648/) listing currently uses an old name for the corpus. **Please use this corrected citation:**

@inproceedings{gessler-etal-2020-amalgum,
title = "{AMALGUM} {--} A Free, Balanced, and Rich {E}nglish Web Corpus",
author = "Gessler, Luke and
Peng, Siyao and
Liu, Yang and
Zhu, Yilun and
Behzad, Shabnam and
Zeldes, Amir",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.648",
pages = "5267--5275",
abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.",
language = "English",
ISBN = "979-10-95546-34-4",
}

# Build
See [BUILD.md](./BUILD.md).

0 comments on commit a703b95

Please sign in to comment.