Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin alphabet characters in Greek text. How to find and eliminate? #342

Open
jcowey opened this issue Nov 15, 2022 · 3 comments
Open

Latin alphabet characters in Greek text. How to find and eliminate? #342

jcowey opened this issue Nov 15, 2022 · 3 comments
Assignees

Comments

@jcowey
Copy link
Member

jcowey commented Nov 15, 2022

Through a mail from D. Kaltsas I have been made aware that a problem in the DDBDP has been encountered:
"I could not find Μαχάτου or Ζηνοδότωι in P.Genova IV 158, but could find αχατου and ηνοδοτωι, meaning there is something the matter with the Μ and the Ζ: most probably that they are the Roman letters rather than the Greek ones. "

I checked that specific text by looking for \u004D (LATIN CAPITAL LETTER M). The analysis is correct. A Latin capital letter (\u004D did not distinguish for me in oXygen the difference between capital and lower case, but gave me all "m"s) is used in Μαχάτου in line 4 of that text at code line 56.

"Experimenting, I got the same result with Μεμφεως (non invenitur) and εμφεως (inventum est) in P.Genova IV 132. "

Is there "some way of estimating the extent of the problem. Just P.Genova IV or more? And just these two letters or other ones shared between the two alphabets, too?"

@hcayless Can you give some guidance as to how best to tackle the problem(s)? There is probably a regex which one could design to cover all Latin characters in all sections of text marked with XML lang 'grc'. Even with a bit of thought that goes a bit beyond what I can do effectively.

@hcayless
Copy link
Member

Huh. I've just done a search and found a bunch of these. And not even just Latin characters sneaking into Greek. There are Greek chars sneaking into Latin! Very odd. I'll play around with seeing what I can do via find and replace to start with.

@jcowey
Copy link
Member Author

jcowey commented Nov 15, 2022

Is it just in the P.Genova IV files?: https://github.com/papyri/idp.data/tree/master/DDB_EpiDoc_XML/p.genova/p.genova.4

Or is it across a wider range of files?

Reason I ask is that I had to transcode a non Unicode Greek font to Unicode for the P.Genova IV files. I have just tested the "M" in Μαχάτου of the line quoted above. And indeed in the file the M is a Latin one.

If it is only in those P.Genova IV files then my transcoding (a bad one as it turns out) will be the source of the mess.

Only good thing about that would be that almost all other files will be clean I hope. I have had in the past to transcode other files to Greek Unicode. https://github.com/papyri/idp.data/tree/master/DDB_EpiDoc_XML/p.prag/p.prag.3

for example. I hope to goodness that the problem is not more widespread.

@jcowey
Copy link
Member Author

jcowey commented Jan 23, 2023

@jcowey jcowey assigned jcowey and Edelweiss and unassigned hcayless Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants