-
Notifications
You must be signed in to change notification settings - Fork 0
/
1880-1940-README.txt
101 lines (84 loc) · 4.34 KB
/
1880-1940-README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
The Newspaper and Periodical Corpus of the National Library of
Finland, Swedish sub-corpus, 1880–1948, scrambled, VRT
Persistent identifier: http://urn.fi/urn:nbn:fi:lb-2020110302
Licence: CC BY 4.0, https://creativecommons.org/licenses/by/4.0/
IPR holder: The National Library of Finland
Short name: klk-sv-1880-1948-s-vrt
Description
The corpus contains the years 1880–1948 of the Swedish sub-corpus of
the Newspaper and Periodical Corpus of the National Library of Finland
in the VRT (VeRticalized Text) format. The data has been digitized by
the National Library of Finland and converted to the VRT format and
annotated by FIN-CLARIN. The sentences within each page have been
scrambled to a random order for copyright reasons.
For some more information, please see the corpus metadata record at
http://urn.fi/urn:nbn:fi:lb-2020110302
The data has been annotated with an old version of Språkbanken’s Korp
corpus pipeline, with text-level metadata from the original data.
Please note that the text data has been programmatically recognized
from page images (OCR’d) and annotated without any manual correction,
so its quality varies significantly.
The data for each year is in a single file, named klk-sv-YYYY-s.vrt.
The data is encoded in UTF-8, with Unix-style line endings (LF). The
literal characters &, < and > have been encoded as the XML predefined
entities &, < and >, and in structural attribute annotations
also " as ".
Each token is on a line of its own, with the token and its annotation
attributes (positional attributes) separated by tabs. The attributes
are the following (in this order, also listed in the
“#vrt positional-attributes” comment at the beginning of the file):
word: word form
pos: part-of-speech tag
msd: morpho-syntactic description
lemma: base form(s)
lex: lemgram(s) (lemma + part-of-speech code)
saldo: lemma(s) with sense information
prefix: prefix lemgram(s)
suffix: suffix lemgram(s)
ref: the number of the token in the sentence
dephead: the number of the dependency head of the token
deprel: dependency relation
ocr: OCR confidence for the token (0.01…1.00)
style: “_” (normal text), “subscript” or “superscript”
The attributes lemma, lex, saldo, prefix and suffix are feature-set
(multi-valued) attributes, in which the different values are separated
by vertical bars (|), with a leading and trailing vertical bar. A lone
vertical bar denotes the empty set (no value).
Structural divisions are marked with XML-style tags, with annotations
associated with each structure as attributes in the start tag. The
order of the annotation attributes may vary. The structures and their
annotation attributes are:
text: A single page of a newspaper or magazine
binding_id: issue identifier used for linking to page images at
the National Library of Finland
datefrom: the first date of the date range covering the issue date
(yyyymmdd): if issue date is a year, “yyyy0101”, if a month,
“yyyymm01”
dateto: the last date of the date range covering the issue date
(yyyymmdd): e.g., if issue data is a year, “yyyy1231”
elec_date: digitization date (yyyy-mm-dd)
file: original single-page VRT file name
img_url: template for page image file name
issue_date: date of the issue in the format [[dd.]mm.]yyyy
issue_no: number of the issue
issue_title: title of the issue
label: name of the publication, issue number and date
language: two-letter ISO 639-1 language code
page_id: page identifier
page_no: page number
part_name: name of the part of publication (seldom used)
publ_id: publication identifier: either ISSN or “fk” + number for
publications without an ISSN
publ_part: part of publication (number) (seldom used)
publ_title: name of the publication
publ_type: type of publication: “sanomalehti” for a newspaper,
“aikakausi” for a periodical
sentcount: number of sentences on the page
timefrom: always “000000” (time information at day granularity)
timeto: always “235959”
tokencount: number of tokens on the page
sentence: A sentence
id: unique identifier of the sentence
Note that sentences broken by page breaks have not been concatenated.
The data also contains some singe-line XML-style comments <!-- ... -->
at the beginning and end of each file.