Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-ingesting files seems to cause access errors #475

Open
bseeger opened this issue Jan 13, 2022 · 16 comments
Open

Re-ingesting files seems to cause access errors #475

bseeger opened this issue Jan 13, 2022 · 16 comments

Comments

@bseeger
Copy link

bseeger commented Jan 13, 2022

This started by seeing errors in the Drupal log for drupal generating it's thumbnails - but this seems to happen in the cloud as well when the external services are running derivatives.

A big distinction to catch here is that drupal makes thumbnails for it's admin facing pages (specifically the media page). Those thumbnails are different from the thumbnails we make in the houdini container - those are user facing thumbnails.

If the drupal admin facing ones fail, it's not a big deal, but it looks like houdini is affected by this issue as well. Just something to keep in mind as you read through below.


Note: I'm seeing this message on the cloud server, but not in my dev environment, so I wonder if it's an AWS permission error.

On the test cloud server: upon going to the Media page, I started seeing these errors in the log:

Unable to generate the derived image located at private://styles/thumbnail/private/2022-01/3061-Service File.jpg.

Screen Shot 2022-01-13 at 3 51 22 PM
Screen Shot 2022-01-13 at 3 51 07 PM
Screen Shot 2022-01-13 at 3 51 34 PM

These are Drupal thumbnails, which are distinct from our derivative thumbnails. Drupal wants to create a thumbnail simply to display the image on the Media List page (to the logged in with rights to use the admin interface). An example:

Screen Shot 2022-01-13 at 4 00 31 PM

These are created in the Drupal container by imagemagick (it does not use the deriv containers). I'm wonder if there's an access error here where Drupal can't get or retrieve the file from AWS? Or maybe the file can't be saved in AWS once created? Not exactly sure what's going on here.

In my local setup, Drupal creates a styles folder in minio for Drupal's thumbnails, like so:
Screen Shot 2022-01-13 at 4 02 39 PM

Perhaps that's not successfully happening on AWS?

@bseeger bseeger added the cloud label Jan 13, 2022
@jhujasonw
Copy link

Is there a way to reliably re-create this issue?

@bseeger
Copy link
Author

bseeger commented Jan 13, 2022

I'm not sure. But I do see the error after I visit the https://test.digital.library.jhu.edu/admin/content/media page - so maybe just visiting it causes the error to be thrown.

@bseeger
Copy link
Author

bseeger commented Jan 19, 2022

After looking at this a little more, I think what is happening is related to allowing multiple ingests of the same file items, and may be related to allowing one to rename files during ingest.

We allow admins to rename files in the ingest to fix filename structures that might not work for the system. This may or may not be the issue - the issue could simply be allowing re-ingest of files with the old files sticking around.... that might be more likely the issue.

So if we have the following setup in the ingest:

name upload name new filename
The Name file_name[0].jpg filename.jpg

If an admin runs that type of ingest twice, the legitimate file will have a _X tacked onto to the name before the extension is added (where X is a number). The ingest algorithm tacks on the _X to disambiguate the files - so it keeps the old one around and creates a new one. So, once this is uploaded twice, the 2nd one (and used one) will be named filename_0.jpg. The File entity will be named correctly with no change. There will now be two File objects named the same but pointing to two different files.

File url media_of
The Name filename_0.jpg Media One
The Name filename.jpg

(the first file ingested will be disconnected from any media - so it's essentially unused.)

The one with Media One set in the media_of field is the real one and is considered referenced by an object. But the second one still exists and if grabbed will result in the 403 error we are seeing. Which is what Drupal seems to do when creating its own thumbnail here.

Perhaps this error is innocuous, as the file is really unused and doesn't need a thumbnail. However, what should be checked is that the proper File (filename_0.jpg) does get a drupal thumbnail.

@bseeger bseeger removed the cloud label Jan 19, 2022
@bseeger
Copy link
Author

bseeger commented Jan 19, 2022

Providing the Islandora file derivatives function correctly, grabbing the correct file (which they seem to), this issue is probably minor in the grand scheme of things and these errors are just noise in the logs (yes, drupal fails to make its thumbnails, but that's admin facing). Providing that's true, then the only affect is that logged in admins will not see a thumbnail on the file list page (/admin/content/media page). which is no biggie in terms of the system functioning.

@bseeger
Copy link
Author

bseeger commented Jan 19, 2022

Actually, I think I am wrong about the scope here after watching the cloud for a while. It appears that the wrong URL is handed to the derivatives as well (or they are somehow fetching the wrong ones). This will be an issue for re-ingest in the cloud services. :(

@bseeger bseeger changed the title Drupal Thumbnail access error in cloud server Re-ingesting files seems to cause access errors Jan 19, 2022
@htpvu htpvu added the vendor label Feb 22, 2022
@jhu-alistair
Copy link

High priority because it blocks our ability to re-ingest when there are errors in an ingest job.

@jhu-alistair
Copy link

Possibly, Bethany thought it was a problem in S3.

@htpvu htpvu added the launch label Mar 23, 2022
@DonRichards
Copy link
Member

I agree this would be classified as a high priority and is likely either an S3 or a production-specific environment config setting.

@jhujasonw
Copy link

Please re-read her notes on this, she later indicates that this is a file naming issue and NOT an S3 or production specific thing. This appears to be something happening inside of drupal

@DonRichards
Copy link
Member

@jhujasonw I think her initial comments were on the right page. I see where she changed her thoughts on it but it appears the URL it generates for an ingest works correctly when this is the first ingest but not for the 2nd. A situation that "could" be the issue is an S3 permission configuration set to write-once (S3:PutObject events) to a bucket. I'm speculating, I have no knowledge of the bucket configurations (and I'm not an expert with S3 ACLs). This just seems like a logical possibility to replicate the odd behavior of "works the first time but not the second". A simple way of checking this would be to either run the exact same migration locally or to trigger a regenerate derivative event in production and see if it fails. If the migration fails locally in the same manner then I'd say it's safe to say the S3 permissions are not the issue. But if it doesn't and triggers a regenerate derivative event ends in a failure in production it's likely to be worth investigating. This is what I thought Bethany had alluded to in her last comment.
There are also other situations that could cause odd behavior in migrations (as she indicated above). If the migration isn't triggering an "update" instead of ingesting new media files, this could cause an issue and see all of the items as new. The migration could also address the filename collision as part of it.
On the other hand, a production-specific solution could be to disable derivative generation while migrating and create a trigger to create/recreate all thumbnails for a given list of media files if the migration isn't causing other issues.

@mjanowiecki
Copy link

Some random info that may or may not be relevant (librarian, not tech person here so please ignore me if this is all nonsense).

  • I know late in the game, Bethany added a field to the media migrations to let them have the capability to rename the files as they are ingested.
  • I have been doing file related ingest work on a different drupal website (not islandora) and I've noticed that files with duplicated names are supposed to be renamed like this "file.jpg" and "file_0.jpg" incrementally but sometimes if you already have a "file_0.jpg", and try to ingest a "file.jpg," drupal will throw an error because it's trying to name the new file "file_0.jpg." When I was troubleshooting this, I found out that I guess there is a setting to change the default from renaming to replacing files with the same name (although I just came up with a different solution) in something called FileSystemInterface?
  • As someone who uses the migration, I wouldn't be against adding a Boolean field or something that turns derivative generation on or off if that simplifies a solution.

@DonRichards
Copy link
Member

This may seem off-topic, but we could avoid the naming collision issue by using unique values as the media's filename. In theory, the original file's hash should not be affected by renaming it. Running a script locally like this could copy the files to a new directory, name them to their hash value, log the original names and the new ones, and output when there's an error.

destination='/processed_images'
echo "" > $destination/log
for file in *.{jpg,jpeg,png,tif,tiff,jp2}
do
    sum=`sha256sum "$file"`
    sum="${sum% $file}"
    cp "$file" "$destination/$sum"
    echo "$file $destination/$sum" >> $destination/log
    [ "$(<$file sha256sum)" = "$(<sha256sum $destination/$sum sha256sum)" ] || echo "Problem with $destination/$sum"
done

This should safeguard the filename collision issue and make identifying duplicates simple. This could always be offloaded to a module instead, something like filehash.

@mjanowiecki
Copy link

Unfortunately, the filenames are important for librarians to manage files and keep them associated with the right items, so we can't really change them without stakeholder approval.

@DonRichards
Copy link
Member

@mjanowiecki This is the case once in Islandora? Or are we talking about offline (preprocessing/reprocessing)?

@mjanowiecki
Copy link

@DonRichards
I think so. It does help track/verify that the right files are with the right item in an easy way, and it's also an access/usability consideration for the end-users. As an end-user/researcher who might be downloading many different files, it's difficult to organize them when the filename has no recognizable association with the metadata (if that makes sense?).

@jhu-alistair
Copy link

@DonRichards and @mjanowiecki - please move work and discussion over to Jira. This issue is now at https://jhulibraries.atlassian.net/browse/LAGS-172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants