The ELTE Novel Corpus is a continuously expanding database developed by the Department of Digital Humanities at Eötvös Loránd University. Currently, the corpus contains 400 Hungarian novels. Besides the texts, the corpus contains the annotation of structural units and the grammatical features of words in TEI XML format. The novels of the corpus are from the 19th century and from the first half of the 20th century.
- number of novels: 400
- number of authors: 119
- number of tokens: 26.8 million
- number of words: 21.4 million
The source of the corpus was the collection of the Hungarian Electronic Library.
- The texts from the Hungarian Electronic Library were converted into TEI XML format based on the Text Encoding Initiative. The TEI XML files contain the annotation of structural units and the metadata of the novels. The conversion was partly done manually (level1).
- Then, we tokenized the novels and annotated the grammatical features of words by using e-magyar, an NLP tool chain for Hungarian texts (level2).
<ns1:authorGender/>
: sex of authorM
: maleF
: female
<ns1:size/>
: size of the novelshort
: 10 000 -- 49 999 wordsmedium
: 50 0000 -- 99 999 wordslong
: more than 100 000 words
<ns1:canonicity/>
: canonicity level of the novellow
: 0 or 1 edition after 1979high
: 2 or more edition after 1979
<ns1:timeSlot/>
: time period of the first edition of the novelT0
: before 1840T1
: 1840--1860T2
: 1860--1880T3
: 1880--1900T4
: 1900--1920T5
: after 1920
<head>
: title<div>
: part, chapter<milestone>
: delimiter of subchapters<p>
: paragraph
<s>
: sentence<w>
: word<pc>
: punctuation mark@lemma
: lemma@pos
: part of speech@msd
: morphosyntactic features (Universal Dependencies)
The folder contains the level1 and level2 files with headers in the format of ELTeC. These files are not valid for TEI, we do not recommend to use these files.
- Gábor Palkó
- Tímea Borbála Bajzát
- Emma Takács
- Bence Vétek
- Zsófia Fellegi
- Péter Horváth
- Balázs Indig
- Bence Vida
- Botond Szemes
- Eszter Szlávich
The content of the repository is licensed under the CC-BY-SA 4.0 license.
All texts of the corpus are in the public domain.