-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pointer file PREMIS version #466
Conversation
Note that due to archivematica/Issues#655, there's still an error when validating the pointer file: "There must be PREMIS elements inside the METS container". |
bc41906
to
e769a3f
Compare
@ross-spencer FYI these changes may cause conflicts with your work in archivematica/Issues#660 — sorry, I didn't realize you were also working on this area! I think we'll have to see what lands first, but happy to rebase this or help with rebasing, whatever is easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great @cole. We've a few pieces of work in this area at the minute, my small change to the package name, @helenst has one cleaning up the way the compression artifacts are recorded in the pointer file too as well as adding to the methods. So it's all quite good, even if the timing is a little awkward at times! 😉
I think it makes sense to merge yours first, and then we can work out what's needed to rebase the other two.
With regards to my comments, the code looks great. I've made some very minor style suggestions. I've asked one question about passing the package into the PREMIS functions. Those questions may or may not result in changes.
storage_service/common/premis.py
Outdated
return metsrw.plugins.premisrw.PREMISEvent(data=event) | ||
|
||
|
||
def get_replication_derivation_relationship( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think beyond the scope of the changes you want to make here. With the refactor, in future, perhaps we can take the time to change to create_
or something approximating that. They don't seem to do a lot of getting! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, before it was a mix of get and create, and I standardized on get, but I think you're right, that's the wrong one! I've renamed them all to create_.
replication_validation_event = replica_package.get_replication_validation_event( | ||
checksum_report=checksum_report, master_aip_uuid=self.uuid | ||
replication_validation_event = premis.get_replication_validation_event( | ||
replica_package, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency here, I think it would be good to see the form replica_package_uuid=replica_package.uuid,
as with the other args. Is there a good reason to pass the entire object into the function at the moment, instead of just using the uuid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point! I've changed to pass just the UUID (or other bits of data) for all PREMIS functions.
# and is primarily used for arbitrary strings (e.g. filenames, paths) | ||
# that might not be valid unicode to begin with. | ||
# NOTE: non-DRY from archivematicaCommon/archivematicaFunctions.py | ||
def escape(string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the circumstances for this code previously, is it the use of unicode_literals
that helps us to manage this when it's output in the new module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Escape was only used on this line:
detail = escape(
"program=GPG; version={}; key={}".format(gpg_version, key_fingerprint)
)
So previously it would have converted the bytestring to a unicode string, after the substitution of gpg_version and key_fingerprint. Now we're just using a unicode string directly. Both of the substitution vars should be already safe - version should be something like 1.4.16
and key_fingerprint comes from the database, so Django should provide us with a unicode string.
checksum_algorithm, checksum, archive_tool, compression_program_version | ||
premis_object = premis.get_aip_premis_object( | ||
self, | ||
checksum_algorithm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the package is being passed into the function, are all of these args needed? E.g. will checksum algorithm be available as package.DEFAULT_CHECKSUM_ALGORITHM
?
Same question for below's replication package object creation.
replica_premis_creation_event = replica_package.get_premis_aip_creation_event( | ||
master_aip_uuid=master_aip_uuid, agents=replica_premis_creation_agents | ||
replica_premis_creation_event = premis.get_premis_aip_creation_event( | ||
replica_package, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replica_package, | |
package=replica_package, |
Though we're really not very consistent across this module.... 😖 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have a mix of args and kwargs here (args being required kwargs, being optional), but I've removed passing packages at all since they're not really required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more thing! (spotted in testing)
def mets_file_now(): | ||
return datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S") | ||
|
||
|
||
def get_ss_premis_agents(inst=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this, but there are a couple of left-over references to this function in the code, e.g. when creating replicated packages:
/home/ross-spencer/git/artefactual-labs/am/src/archivematica-storage-service/storage_service/common/management/commands/import_aip.py:
380 shutil.rmtree(compressed_aip_parent_path)
381 aip_model_inst.size = utils.recalculate_size(new_full_path)
382: compression_agents = utils.get_ss_premis_agents()
383 compression_event = premis.get_premis_aip_compression_event(
384 details["event_detail"],
/home/ross-spencer/git/artefactual-labs/am/src/archivematica-storage-service/storage_service/locations/models/package.py:
1142
1143 # 3. Construct the pointer file and return it
1144: replica_premis_creation_agents = utils.get_ss_premis_agents()
1145 __, compression_program_version, archive_tool = (
1146 master_compression_event.compression_details
....
1198 old_premis_events = old_fsentry.get_premis_events()
1199 old_premis_agents = old_fsentry.get_premis_agents()
1200: ss_agents = utils.get_ss_premis_agents()
1201 replication_event = premis.get_replication_event(
1202 self, replica_package, event_uuid=replication_event_uuid, agents=ss_agents
....
1238 old_premis_agents = old_fsentry.get_premis_agents()
1239 new_premis_events = list(set(old_premis_events + storage_effects.events))
1240: new_premis_agents = list(set(old_premis_agents + utils.get_ss_premis_agents()))
1241 new_composition_level = old_composition_level
1242 if storage_effects.composition_level_updater:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱 oops, that's no good. Thanks for catching! Updated now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cole, the code looks good, but I'm getting one last failure on reingest with this branch. It's showing up as a 500 internal server error in Archivematica and I'm not getting any additional logs in the Storage Service. If you've time can you take a look at that and double check if it is introduced here?
I've rebased on top of 5fb80c5 and tried that commit on its own as well. So I think the vanilla qa/0.x
branch at the moment is okay.
@@ -360,7 +360,7 @@ def compress(aip_model_inst, compression_algorithm): | |||
"""Use the Package model's compress_package method to compress the AIP | |||
being imported, update the Package model's ``size`` attribute, retrieve | |||
PREMIS agents and event for the compression (using the package model's | |||
``get_premis_aip_compression_event`` method) and return a 3-tuple: | |||
``create_premis_aip_compression_event`` method) and return a 3-tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus changes!! 🎆 Thanks Cole!
It seems that during reingest processing the PREMIS 3 namespace was causing errors. I've updated the pointer file reingest to use metsrw as well, as I'd missed that and it was resulting in an odd mix of PREMIS 2 and 3. There is a change to the resulting PREMIS output though, due to metsrw's behaviour — instead of being replaced, the old techMD containing the premis:object is now superseded (example output). If we think that's a problem I think I know how to work around it — but it may actually be better? I guess that's worth some discussion. I'm not available tomorrow, and I'm aware this is blocking other work, so @ross-spencer if you need to press on I'd suggest we can accept this PR as is with the understanding that revisions can happen in a separate one (provided reingest is now working for you of course!). Otherwise I can pick it up on the 13th. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM. Thanks Cole. Will rebase and merge!
5567bff
to
1505229
Compare
This commit updates the storage service to use PREMIS 3 in pointer file generation. Various components have been touched upon including initial ingest, replication, and then reingest. The commit history is summarized as follows: * Update metsrw. * Use PREMIS 3 in pointer METS file. * Consolidate pointer file generation logic. * Timezone aware timestamp added to pointer file generation. * Various code-review and test follow-up changes.
1505229
to
cf231e4
Compare
Relates to: archivematica/Issues#713
Update metsrw to the latest release for PREMIS 3 compatibility, and update PREMIS generation to use the correct format.
The bulk of this PR is consolidating PREMIS tuple construction in
common.premis
— please note when reviewing that most of the changed code has been moved (mostly as is, with a few changes for version 3 and to accommodate different imports and references to packages) to try to make updates like this a bit easier in future. The consolidation isn't "done" by any means; there's still a lot of PREMIS and METS related logic in the Package model.