-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev/issue 8894 premis parsing #20
Dev/issue 8894 premis parsing #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the PREMIS work, but this might not be the direction we want to take PREMIS parsing.
Since this is the METS reader-writer, and METS can contain a variety of other metadata standards, it makes sense to leave the parsing of xmlData
contents to something that users can easily add - some sort of plugin system. We could ship a couple of plugins for specific metadata standards ourselves (the ones we use, eg PREMIS & DublinCore), and this is a good candidate for a premis plugin, and so is https://github.com/uchicago-library/uchicagoldr-premiswork
I think this should wait to be merged until a longer term solution for how to handle parsing a variety of metadata format has been discussed, since that's out of scope for this PR.
|
||
def __init__(self, title=None, creator=None, subject=None, description=None, publisher=None, contributor=None, date=None, format=None, identifier=None, source=None, relation=None, language=None, coverage=None, rights=None): | ||
for element in self.DC_ELEMENTS: | ||
setattr(self, element, locals()[element]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great use case for kwargs
instead of accessing locals()
directly. We could replace the list of parameters with **kwargs
, which would allow us to accept an arbitrary number of inputs. kwargs
is a dictionary. Then the loop could be setattr(self, element, kwargs[element])
:raises exceptions.ParseError: If the root is not xmlData or doesn't contain a dublincore element. | ||
""" | ||
if root.tag != utils.lxmlns('mets') + 'xmlData': | ||
raise exceptions.ParseError('DublinCoreXmlData can only parse xmlData elements with mets namespace.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DC is a plugin and a therefore a separate module, This should be a different, DC-specific exception, possibly that subclasses ParseError.
metsrw/metadata.py
Outdated
|
||
def serialize(self): | ||
nsmap = {'mets': utils.NAMESPACES['mets'], 'xsi': utils.NAMESPACES['xsi'], 'xlink': utils.NAMESPACES['xlink']} | ||
root = etree.Element(utils.lxmlns('mets') + 'xmlData', nsmap=nsmap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDWrap already creates the xmlData element when serializing, and it makes sense to have DublinCore only handle elements in its own namespace. This could be merged with _serialize_dublincore below.
parser = etree.XMLParser(remove_blank_text=True) | ||
if isinstance(document, six.string_types): | ||
self.document = etree.fromstring(document, parser=parser) | ||
elif isinstance(document, etree._Element): | ||
self.document = document | ||
self.mdtype = mdtype | ||
self.data = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a new attribute, this could use document
to store the child document - whether that's a string, ElementTree, or plugin class of the appropriate type.
@@ -291,13 +345,16 @@ class MDWrap(object): | |||
:param str mdtype: The MDTYPE of XML document being enclosed. Examples | |||
include "PREMIS:OBJECT" and "PREMIS:EVENT". | |||
""" | |||
def __init__(self, document, mdtype): | |||
MDTYPE_CLASSES = {'DC': DublinCoreXmlData} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! It will have to be improved to talk to a plugin interface though.
raise exceptions.ConstructError('event_type argument is required.') | ||
|
||
if event_datetime is None: | ||
raise exceptions.ConstructError('event_datetime argument is required.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels unpythonic - these parameters are required by the function, and if they're None, it will raise an error later.
|
||
event_type_el = etree.Element(utils.lxmlns('premis') + 'eventType') | ||
event_type_el.text = self.event_type | ||
root.append(event_type_el) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip: These three lines can be etree.SubElement(root, utils.lxmlns('premis') + 'eventType').text = self.event_type
This should be closed in favour of #34, which adds these changes to current master and makes them compatible with the recently-introduced plugin/ dependency injection pattern. Once the comments/ CR from this PR are transferred to PR 34, this can be closed. |
Closed in favour of #34 |
Added PREMIS object/event parsers.