-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publication thread for Fall 2018 #19
Comments
RE Besa's letters - there are way more than two - all of MONB.BA and BB, so almost all of the existing letters. We've only processed two so far though, and I'm not sure if we'll run into problems with later ones. |
Yes, I plan to do some more Johannes. I may also have something from Budge from my Coptic reading group at Reed. |
I'm looking at items in Gitdox for publication. Looks like in addition to treebanked material in Mark, 1 Cor, Victor, A22, and AOF we also have Eagerness docs. Is that correct, @amir-zeldes ? Are these newly treebanked Eagerness docs? We also have a doc from Not Because a Fox barks. Are we republishing this corpus? Last, there look to be some validation issues. I will go through them and let you know if I have any problems or questions. (I am skipping the AP, since we will have new AP to publish in the Winter.) |
Eagerness has no treebanked documents, so any edits are presumably sporadically noticed errors (probably no more than a handful). If there are no new documents, maybe we should wait with Eagerness until there is new material - I think there were still new documents coming in the future, right? Similarly NBFB may have some tiny correction, but otherwise nothing new really. I may wait with it until we are closer to 'one click publication'. The rest have considerable changes due to treebanking and should be re-imported, they are much better quality now. |
I do not think there are more documents for Eagerness. I have been done with it for awhile and have moved on to Those.
Becky
…Sent from my iPhone
On Oct 21, 2018, at 10:06 AM, Amir Zeldes <[email protected]<mailto:[email protected]>> wrote:
Eagerness has no treebanked documents, so any edits are presumably sporadically noticed errors (probably no more than a handful). If there are no new documents, maybe we should wait with Eagerness until there is new material - I think there were still new documents coming in the future, right?
Similarly NBFB may have some tiny correction, but otherwise nothing new really. I may wait with it until we are closer to 'one click publication'. The rest have considerable changes due to treebanking and should be re-imported, they are much better quality now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#19 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7jHKtDyzmLDdgTR9mjJPOIM2-wQUwMks5unH9jgaJpZM4Uk1OQ>.
|
That's good to know thanks! We could try to squeeze them in, but as I wrote above, the changes are probably minimal, so maybe we should wait until we treebank some of Eagerness. Another question about Mark/1 Cor - I see failed validations due to 'p' missing - do we want to require p? If so in what units? p mainly serves to segment the normalized view for convenience, but for Bible chapters, the verses already do a good job of that, so maybe we can remove this requirement for the Bible? |
Hi. I was making a list of things to go over as I was reviewing the corpora for publication, and "p" was on the list. It relates to our decision in DC to minimize the number of visualizations & viz names, as well. I think we can change the validation to p | vid_n. I am adding vid_n (the cts urns at the verse level) to all corpora as they are re-published. The visualizations break the text at p or at v, right? v is the verse number written as a number, and vid_n is the urn for the verse (same span as v). I would prefer the validation to be p | vid_n to remind us to add those cts urns. |
I'm making a list of things that are coming up, that I'll post when I'm done. But two big ones:
|
Mark is only 1-6 treebanked at the moment, same as 1Cor. Meta should show everything as gold for the treebanked chapters, and pos/seg 'checked' for the rest, parsing 'auto'. The tag/seg/parse metadata is document-wise, so mixed corpora should not be a problem. |
RE p-annotations: I went ahead and made the p-check corpus dependent. There is no current way to make one validation check for either/or, so we'd need a separate rule to require vid_n in corpora where that's relevant (all corpora?) |
Can you go in and edit the parsing /tagging/segmenting metadata? There are a number of docs “review” or “to publish” and there is no way for me to tell currently which ones in each corpus are treebanked (since the parsing data is elsewhere). I realize a mixed corpus re treebaninking is fine technically, but from an annotating/curating point of view, having me make those edits to the metadata is asking for trouble, because Some corpora have both treebanked and nontreebanked docs for review or publication. I think you need to go in and make the changes since there are mixed treebanked corpora. |
I will for sure check the rest of the metadata and add corpus metadata to the corpora (like 1Cor) without. |
I can do the automation meta for Sahidica. It's pretty well documented though, the list of treebanked documents is in the table here: |
Generally the process is most effective and accurate when each annotator adds/edits metadata when they annotate. Otherwise there is a lot of back and forth from the person doing review, or something gets missed. There's not an effective way for the person conducting the final editorial review to keep in their heads which metadata might change and which might not for each publication thread. The person doing the review (not always me) needs be able to look at the metadata for obvious errors, like typos or missing fields, but other than version number/date isn't expected to go through each existing field and ask whether the data needs to be changed. I will go ahead and reassign docs back to you to for checking the parsing/tagging/segmentation metadata before publication. Thanks! |
ⲞⲔ, 1Cor and Mark should be good to go from the NLP metadata perspective. I also corrected any validation errors that are automatically caught, so they're all green, but I'm not sure if there's something we wanted but haven't added a validation for yet. |
? I'm not sure I understand the preceding comment - I have no metadata changes to make that I'm aware of. I'm happy to keep NLP metadata up to date as we treebank in the future, but these are fields that didn't exist when the treebanking happened. Sahidica is now up to date. |
If I have any questions about the other mixed corpora besides the Sahidica ones, I'll let you know. |
Thanks for editing the Sahidica ones! |
@amir-zeldes can you tell me who has been treebanking (and then correcting tagging/segmentation) for the AOF, Victor, A22, Mark, 1 Cor texts? I will add their names to the corpus and document metadata. Thanks! |
Mark+1cor new material is Mitchell. A22 is me and Liz. AOF is just me, Victor is me, Mitchell and the four Israelis listed here: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/blob/dev/README.md (under acknowledgments) |
Thank you! 1Cor is ready (I also added corpus metadata to GitDox) except for these questions:
A couple of questions about Mark:
|
Nothing to worry about! A couple sections were tree-banked, so I am adding verses and chapters. It's not a big deal. Just updating those two files to current standards since we are republishing.
Sent from iCloud
|
AOF is done |
Here are the urns that need addressing. We either should put something in the 404 or somehow redirect. Probably easier to list them on the 404 for now. I think any redirect is complex with this application. |
Also @amir-zeldes can you check the parsing/segmentation/tagging metadata for the files that were treebanked? Not sure I got them right.
|
Mark is as ready as it will be. @amir-zeldes I think we're good to go. |
OK, I went over the AOF exports, they all check out now. The way to see those line numbers is to do a TEI export from the editor, download the file, and find that line number in the file. Usually I look for some word or translation nearby and actually check out the grid instead of trying to figure out the XML, since the error is usually apparent in the grid too. XL still needs versification, it seems - did you say you wanted to just follow the translations or doing something else? Once that checks out I think we really are good to go! Oh, and one more thing, is Besa/Vigilance included? |
Hi,
I didn't add verses to the part of XL that is not AOF. XL is a florilegium codex. I'm not sure what work that piece of XL is from. I'd have to look it up. The AOF section has verses and verse ids. I did go through just now to make sure the spans coincide with each other. If empty spans are a problem then you can just put in some placeholder like "undetermined".
I have not had time to touch Besa. The other corpora ended up more complicated than I anticipated. Besa is next on my list. If you want to wait for that it may take a week or more, because I need to check the metadata pretty closely and add all the cts urns, and I have to go over the final white paper comments from board members (Heike's final report is due Dec 15, and we want to be sure they are close.)
Only two letters of Besa are marked for review in Gitdox. Can you be sure everything you want reviewed is marked for review? I don't want to miss anything. Thanks so much!
Best,
Carrie
Sent from iCloud
On Dec 06, 2018, at 12:19 PM, Amir Zeldes <[email protected]> wrote:
OK, I went over the AOF exports, they all check out now. The way to see those line numbers is to do a TEI export from the editor, download the file, and find that line number in the file. Usually I look for some word or translation nearby and actually check out the grid instead of trying to figure out the XML, since the error is usually apparent in the grid too.
XL still needs versification, it seems - did you say you wanted to just follow the translations or doing something else? Once that checks out I think we really are good to go!
Oh, and one more thing, is Besa/Vigilance included?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
AOF can't be published without filled verse/vid, based on the new schema. I'm happy to put 'undetermined' there, but it does seem odd... If they're another work, then we should consider putting them in another corpus one of these days, since for other works, our corpus objects logically correspond to the works, not codices. I'm fine waiting with Besa, in the meantime I'll start processing the other corpora so we can get the release published. As for review, really all that means is 'was edited', so maybe we should start calling these something else (review sounds like the whole document needs to be reviewed, where really only some tiny changes occurred) |
I can break up XL, but it's one doc in the treebanking corpus so I didn't want to mess with it. |
Yeah, I understand, I'm not disagreeing about changing statuses right now, but I think we should have a conversation about this again sometime - possibly in parallel with or just before training the new DH specialist. I just fixed XL, so it's 'undetermined' for URNs and 'x' for the verse number, that should work for conversion and validation for now. AOF validates now, we are good to go, leaving Besa aside. |
Ok that all sounds good, thanks!! |
OK, everything is up on ANNIS now, excluding new Besa for the moment. Please take a look and if all looks well I can set up the ingest and push to GitHub as well. |
Thanks so much, @amir-zeldes. I will look over this tonight and tomorrow. In the meantime, what do you think about the issue of the urns? (#19 (comment)) |
I had a look at the Excel table - these all seem to be pure URNs, not URLs, so I'm not sure what you mean by 404 above - if they're URLs we can probably set the server's apache to intercept and 404 them, but if we are at the URN level, this is something the repo software needs to handle, no? Does it have any existing functionality to handle redirects or some kind of '404'-like scenario? |
What happens is the web application either 1) takes the URN someone types into a box and spits out the corresponding URL, which is operationalized (maybe wrong word here?) as a list of documents that contain that URN; or 2) takes the URL data.copticscriptorium.org/URN generated by someone clicking on that link somewhere else or typing it into their browser bar and likewise generates a page that is a list of documents that contain that URN. |
Oh hey now the results message is in the index file https://github.com/CopticScriptorium/cts/blob/master/coptic/templates/index.html I really do not know how the repo would manage a redirect. |
Meanwhile, these documents look great in ANNIS. Just a couple things. I made small edits to the following files, so they need to be redone:
Also:
|
OK, the SNP bug should be resolved. Mark has also been updated, and we have fresh versions of AOF and A22 as well. I've spot checked them, but please take a look as well. |
Regarding the redirect, yes, if it were a URL based system they could be intercepted at the apache level, before the app ever sees the request, but the way it's been built this would require making some code changes. Maybe we should put some renovations to the repo on the agenda for next semester. I think we should talk about prioritization again at some point in January. |
Ok so in the meantime should I modify the text for the “no results...” message?
…Sent from my iPad
|
Mmm... I guess you could, sure. Ultimately I'd like a better solution for this though, but it might take some time, so this could be a good band aid. |
Right. We need something in the meantime. I am home sick today but will get on this Monday.
|
OK release is basically done except for a few behind-the-curtain actions:
|
@amir-zeldes is there a way for an admin to batch change status "to_publish" to "published" for all docs with that status?thx! |
(@amir-zeldes also I switched AP 18 and 26 from "to_publish" to "review." I see from GitDox commits they have been treebanked. "to_publish" indicates everything is ready, including updated metadata for version # and date; these will still need some metadata changes. Thx!!!) |
I can change values in the DB using a SQL statement to change statuses en masse, though maybe this is a good feature to have. If you need me to do something like that let me know, in the meantime I'll open an issue. |
Ok I think it’s not a high priority if you’re willing to do it yourself. So in the meantime could you please switch everything in GitDox we published to “publish”? Should be everything currently labeled to_publish. Thanks! |
Done! |
- Shenoute canons 6 @somiyagawa)MOVED to Publication Thread for Spring 2019 #22- needs permission from Heike Behlmer- needs metadata (see thread Review metadata for new Canons 6 corpora #20 )- More Johannes canons? (@eplatte )- Some Kinds of People Sift Dirt (@cluckmarq)- God Says Through Those Who Are His (@bkrawiec )For automated corpora we will add info about fully automated tokenization, annotations in metadata.
Possible:
-AP (if Marina has new AP)-Besa (@amir-zeldes @somiyagawa) (Besa from two main codices)MOVED to #22- needs permission from Heike Behlmer for translation- needs to be broken into documents- needs to align translation (scrape translation text will take a few days; need to manually align translation)- needs metadataBefore publication when checking metadata:
The text was updated successfully, but these errors were encountered: