Replies: 1 comment
-
Hi again @tombogle. After a few months focus on other things, I've been pondering devoting some time to combining the available datasets (which AFAIK are Glyssen data, Viz.Bible, and STEPBible) into one table (with all the different speaker ids) and then adding a new, improved (ha ha) id column (using English as the "metalanguage" as per your #3 here). I agree that the STEPBible ids containing the first Bible reference are too verbose for human use. And the Viz.Bible ones contain consecutive numbers 1,2,3... which are often superfluous when there's only one person with that name. Some datasets contain things like the age of the speaker (not sure yet how to handle young and old David, etc. yet -- I'll see what Glyssen did -- it's definitely only one person but two different voices required) and some contain lists of relatives, time periods, geographical locations, etc. I guess Glyssen is only interested in people that speak, but some DBs will include all named people and places, etc. Not certain off the top of my head if any of those datasets contains a brief prose description of each person, i.e., something like "The second son of King X of Moab and mother Y who died in a battle at Jericho with no record of any surviving descendants." The above datasets are open-licensed, and just as importantly, I get the impression the owners are actually happy for others to use and reuse their work. I'm not trying to twist anyone's arm if they're not -- we'll just redo the work if necessary and give it as a gift to the world. (I think @robertrouse said that the DigitalManna work is mostly represented in his Viz.Bible tables, and I've heard no update on the YWAM data becoming available.) I'm sure that doing this would be much more work than initially meets the eye, but would hope that the process of inputting and combining all the different formats (XML, JSON, TSV, etc.) and then the ability to view the data in one wide table might bring more understanding and clarification to my slow old brain. Oh, another difference in lists of references perhaps is that I'm not interested in knowing that Jesus speaks in Matt 5:3, Matt 5:4, Matt 5:5, Matt 5:6, etc. That should be one reference Matt 5:3-7:27. (But yes, I realise that some translations will add text like, "And Jesus continued, ...", typically after headings.) I'm working on other NT projects at the moment, so the NT is more my current focus, and CNTR is soon to release the new, open Statistical Restoration GNT (the website displays RC1), so I would also connect it to that. (I'm also pondering some kind of stand-off markup to go back from Bibles to data tables -- I think USFM simply isn't robust enough for the rich markup that I have in mind.) My initial thoughts would be to eventually make the master table XML, with Python scripts to derive JSON and TSV tables. Then others could easily just adjust the scripts to remove data fields that are irrelevant to their applications. I would do all the work right here in this repo although possibly changing the name since it's not just speakers that some of the rest of us are interested in. (Mostly just thinking aloud at this point -- any comments welcome.) |
Beta Was this translation helpful? Give feedback.
-
This is probably the big idea and at least initially should be the main focus of our discussions. If we can succeed at doing this, it maximizes the potential for sharing and repurposing data. If we can't (and there are some reasons we might not), then we might need to fall back to discussing standards for exchange and interoperability.
Beta Was this translation helpful? Give feedback.
All reactions