You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are certain xml files in version 2.1 of the treebank that contain unicode 787, combining comma above, as a marker for elision. I could be misunderstanding something, but this seems like a mistake. This character should be 8125 (koronis) or 8217 (apostrophe). The combining character is a non-spacing version of the koronis, and it combines with whatever character follows it.
The instances of this character can of course be located and cleaned up most easily with software, but as an example, there is tlg0012.tlg001.perseus-grc1.tb.xml, which has the following for Iliad 2.191:
In the text editors and browser I'm using, what I see here is that the 787 character combines with the double quote after it and is displayed in a way that is clearly incorrect.
Here the iota+apostrophe in δαιμόνι' is encoded as an iota with smooth breathing. I think this should also be pretty simple to take care of with a computerized search. E.g., if you have a word that's three characters or more in length, and the final character has a breathing mark, then that has to be a mistake. Ditto if the word has a breathing mark but the first character isn't a vowel or ρ.
Here is some ruby code I wrote that I used to postprocess the treebank xml files to deal with these issues and some others:
def clean_up_combining_characters(s)
combining_comma_above = [787].pack('U')
greek_koronis = [8125].pack('U')
s = s.sub(/#{combining_comma_above}/,greek_koronis)
# seeming one-off errors in perseus:
s2 = s
s2 = s2.sub(/#{[8158, 7973].pack('U')}/,"ἥ") # dasia and oxia combining char with eta
s2 = s2.sub(/#{[8142, 7940].pack('U')}/,"ἄ") # psili and oxia combining char with alpha
s2 = s2.sub(/#{[8142, 7988].pack('U')}/,"ἴ")
s2 = s2.sub(/ἄἄ/,'ἄ') # why is this necessary...??
s2 = s2.sub(/ἥἥ/,'ἥ') # why is this necessary...??
s2 = s2.sub(/#{[769].pack('U')}([μτ])/) {$1} # accent on a mu or tau, obvious error
s2 = s2.sub(/#{[769].pack('U')}ε/) {'έ'}
s2 = s2.sub(/#{[180].pack('U')}([κ])/) {$1} # accent on a kappa, obvious error
s2 = s2.sub(/#{[834].pack('U')}/,'') # what the heck is this?
s2 = s2.sub(/ʽ([ἁἑἱὁὑἡὡ])/) {$1} # redundant rough breathing mark
s2 = s2.sub(/(?<=[[:alpha:]][[:alpha:]])([ἀἐἰὀὐἠὠ])(?![[:alpha:]])/) { $1.tr("ἀἐἰὀὐἠὠ","αειουηω")+"᾽" }
# ... smooth breathing on the last character of a long word; this is a mistake in representation of elision
# https://github.com/PerseusDL/treebank_data/issues/31
s2 = s2.sub(/#{[787].pack('U')}/,"᾽")
# ... mistaken use of combining comma above rather than the spacing version
# https://github.com/PerseusDL/treebank_data/issues/31
if s2!=s then
$stderr.print "cleaning up what appears to be an error in a combining character, #{s} -> #{s2}, unicode #{s.chars.map { |x| x.ord}} -> #{s2.chars.map { |x| x.ord}}\n"
s = s2
end
return s
end
The text was updated successfully, but these errors were encountered:
There are certain xml files in version 2.1 of the treebank that contain unicode 787, combining comma above, as a marker for elision. I could be misunderstanding something, but this seems like a mistake. This character should be 8125 (koronis) or 8217 (apostrophe). The combining character is a non-spacing version of the koronis, and it combines with whatever character follows it.
The instances of this character can of course be located and cleaned up most easily with software, but as an example, there is tlg0012.tlg001.perseus-grc1.tb.xml, which has the following for Iliad 2.191:
In the text editors and browser I'm using, what I see here is that the 787 character combines with the double quote after it and is displayed in a way that is clearly incorrect.
A similar issue occurs on the same line of Homer:
Here the iota+apostrophe in δαιμόνι' is encoded as an iota with smooth breathing. I think this should also be pretty simple to take care of with a computerized search. E.g., if you have a word that's three characters or more in length, and the final character has a breathing mark, then that has to be a mistake. Ditto if the word has a breathing mark but the first character isn't a vowel or ρ.
Here is some ruby code I wrote that I used to postprocess the treebank xml files to deal with these issues and some others:
The text was updated successfully, but these errors were encountered: