Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pointer file PREMIS version #466

Merged
merged 1 commit into from
Jun 13, 2019

Conversation

cole
Copy link
Contributor

@cole cole commented Jun 6, 2019

Relates to: archivematica/Issues#713

Update metsrw to the latest release for PREMIS 3 compatibility, and update PREMIS generation to use the correct format.

The bulk of this PR is consolidating PREMIS tuple construction in common.premis — please note when reviewing that most of the changed code has been moved (mostly as is, with a few changes for version 3 and to accommodate different imports and references to packages) to try to make updates like this a bit easier in future. The consolidation isn't "done" by any means; there's still a lot of PREMIS and METS related logic in the Package model.

@cole
Copy link
Contributor Author

cole commented Jun 6, 2019

Note that due to archivematica/Issues#655, there's still an error when validating the pointer file: "There must be PREMIS elements inside the METS container".

@cole
Copy link
Contributor Author

cole commented Jun 7, 2019

@ross-spencer FYI these changes may cause conflicts with your work in archivematica/Issues#660 — sorry, I didn't realize you were also working on this area! I think we'll have to see what lands first, but happy to rebase this or help with rebasing, whatever is easier.

Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great @cole. We've a few pieces of work in this area at the minute, my small change to the package name, @helenst has one cleaning up the way the compression artifacts are recorded in the pointer file too as well as adding to the methods. So it's all quite good, even if the timing is a little awkward at times! 😉

I think it makes sense to merge yours first, and then we can work out what's needed to rebase the other two.

With regards to my comments, the code looks great. I've made some very minor style suggestions. I've asked one question about passing the package into the PREMIS functions. Those questions may or may not result in changes.

return metsrw.plugins.premisrw.PREMISEvent(data=event)


def get_replication_derivation_relationship(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think beyond the scope of the changes you want to make here. With the refactor, in future, perhaps we can take the time to change to create_ or something approximating that. They don't seem to do a lot of getting! 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, before it was a mix of get and create, and I standardized on get, but I think you're right, that's the wrong one! I've renamed them all to create_.

replication_validation_event = replica_package.get_replication_validation_event(
checksum_report=checksum_report, master_aip_uuid=self.uuid
replication_validation_event = premis.get_replication_validation_event(
replica_package,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency here, I think it would be good to see the form replica_package_uuid=replica_package.uuid, as with the other args. Is there a good reason to pass the entire object into the function at the moment, instead of just using the uuid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point! I've changed to pass just the UUID (or other bits of data) for all PREMIS functions.

# and is primarily used for arbitrary strings (e.g. filenames, paths)
# that might not be valid unicode to begin with.
# NOTE: non-DRY from archivematicaCommon/archivematicaFunctions.py
def escape(string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the circumstances for this code previously, is it the use of unicode_literals that helps us to manage this when it's output in the new module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Escape was only used on this line:

    detail = escape(
        "program=GPG; version={}; key={}".format(gpg_version, key_fingerprint)
    )

So previously it would have converted the bytestring to a unicode string, after the substitution of gpg_version and key_fingerprint. Now we're just using a unicode string directly. Both of the substitution vars should be already safe - version should be something like 1.4.16 and key_fingerprint comes from the database, so Django should provide us with a unicode string.

checksum_algorithm, checksum, archive_tool, compression_program_version
premis_object = premis.get_aip_premis_object(
self,
checksum_algorithm,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the package is being passed into the function, are all of these args needed? E.g. will checksum algorithm be available as package.DEFAULT_CHECKSUM_ALGORITHM?

Same question for below's replication package object creation.

replica_premis_creation_event = replica_package.get_premis_aip_creation_event(
master_aip_uuid=master_aip_uuid, agents=replica_premis_creation_agents
replica_premis_creation_event = premis.get_premis_aip_creation_event(
replica_package,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
replica_package,
package=replica_package,

Though we're really not very consistent across this module.... 😖 🤓

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have a mix of args and kwargs here (args being required kwargs, being optional), but I've removed passing packages at all since they're not really required.

Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more thing! (spotted in testing)

def mets_file_now():
return datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")


def get_ss_premis_agents(inst=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this, but there are a couple of left-over references to this function in the code, e.g. when creating replicated packages:

/home/ross-spencer/git/artefactual-labs/am/src/archivematica-storage-service/storage_service/common/management/commands/import_aip.py:
  380      shutil.rmtree(compressed_aip_parent_path)
  381      aip_model_inst.size = utils.recalculate_size(new_full_path)
  382:     compression_agents = utils.get_ss_premis_agents()
  383      compression_event = premis.get_premis_aip_compression_event(
  384          details["event_detail"],

/home/ross-spencer/git/artefactual-labs/am/src/archivematica-storage-service/storage_service/locations/models/package.py:
 1142  
 1143          # 3. Construct the pointer file and return it
 1144:         replica_premis_creation_agents = utils.get_ss_premis_agents()
 1145          __, compression_program_version, archive_tool = (
 1146              master_compression_event.compression_details
 ....
 1198          old_premis_events = old_fsentry.get_premis_events()
 1199          old_premis_agents = old_fsentry.get_premis_agents()
 1200:         ss_agents = utils.get_ss_premis_agents()
 1201          replication_event = premis.get_replication_event(
 1202              self, replica_package, event_uuid=replication_event_uuid, agents=ss_agents
 ....
 1238          old_premis_agents = old_fsentry.get_premis_agents()
 1239          new_premis_events = list(set(old_premis_events + storage_effects.events))
 1240:         new_premis_agents = list(set(old_premis_agents + utils.get_ss_premis_agents()))
 1241          new_composition_level = old_composition_level
 1242          if storage_effects.composition_level_updater:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 oops, that's no good. Thanks for catching! Updated now.

@cole cole requested a review from ross-spencer June 10, 2019 17:35
Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cole, the code looks good, but I'm getting one last failure on reingest with this branch. It's showing up as a 500 internal server error in Archivematica and I'm not getting any additional logs in the Storage Service. If you've time can you take a look at that and double check if it is introduced here?

I've rebased on top of 5fb80c5 and tried that commit on its own as well. So I think the vanilla qa/0.x branch at the moment is okay.

@@ -360,7 +360,7 @@ def compress(aip_model_inst, compression_algorithm):
"""Use the Package model's compress_package method to compress the AIP
being imported, update the Package model's ``size`` attribute, retrieve
PREMIS agents and event for the compression (using the package model's
``get_premis_aip_compression_event`` method) and return a 3-tuple:
``create_premis_aip_compression_event`` method) and return a 3-tuple:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus changes!! 🎆 Thanks Cole!

@cole
Copy link
Contributor Author

cole commented Jun 12, 2019

It seems that during reingest processing the PREMIS 3 namespace was causing errors. I've updated the pointer file reingest to use metsrw as well, as I'd missed that and it was resulting in an odd mix of PREMIS 2 and 3.

There is a change to the resulting PREMIS output though, due to metsrw's behaviour — instead of being replaced, the old techMD containing the premis:object is now superseded (example output).

If we think that's a problem I think I know how to work around it — but it may actually be better? I guess that's worth some discussion. I'm not available tomorrow, and I'm aware this is blocking other work, so @ross-spencer if you need to press on I'd suggest we can accept this PR as is with the understanding that revisions can happen in a separate one (provided reingest is now working for you of course!). Otherwise I can pick it up on the 13th.

@ross-spencer
Copy link
Contributor

Thanks @cole merging now seems a good idea as we will be trying to bring together a handful of storage service changes. I've logged the issue with the superseded section here, and noticed one possible issue with it if we keep this functionality.

Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM. Thanks Cole. Will rebase and merge!

@ross-spencer ross-spencer force-pushed the dev/pointer-mets-premis-update branch 2 times, most recently from 5567bff to 1505229 Compare June 12, 2019 10:53
This commit updates the storage service to use PREMIS 3 in pointer
file generation. Various components have been touched upon including
initial ingest, replication, and then reingest.

The commit history is summarized as follows:

* Update metsrw.
* Use PREMIS 3 in pointer METS file.
* Consolidate pointer file generation logic.
* Timezone aware timestamp added to pointer file generation.
* Various code-review and test follow-up changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants