Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if user can update node #286

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

XueSheng-GIT
Copy link
Contributor

...and change to an alternative user if not (fixes ocr for filedrop and groupfolder acl).

@R0Wi This is just a rough proof of concept to fix #273
In comparison to the discussion at #110 (#116) this approach does first check whether the current user has the permissions to update the node. If not, an alternative user is chosen (but not owner, because this may fail for groupfolders).

This draft does not include a setting, test or further documentation. I'm also wondering whether the manual implementation for createNewFileVersion is really required.

$this->createNewFileVersion($newFilePath, $fileContent, $fileId, $fileMtime);

Wouldn't $node->putContentbe enough?
https://github.com/nextcloud/server/blob/2651d4964ec26696b5f8b1f5db48588a5d09ab01/lib/private/Files/Node/File.php#L49

@R0Wi
Copy link
Contributor

R0Wi commented Dec 17, 2024

Thanks for opening this PR. I like the idea of using a fallback user in case the original user doesn't have the proper permissions to overwrite the file. Although I'd prefer to not pick any random user, this could lead to confusion.

Maybe let's do some testing again with $node->getOwner() instead? I could also imagine having an admin global config option to set a fallback user explicitly.

To answer your question: the implementation for creating a new file version is needed since it ensures that we're not stuck in an endless loop. Also it covers the case where a new file is created instead of creating a new file version (for example when processing jpg to pdf).

@XueSheng-GIT
Copy link
Contributor Author

Maybe let's do some testing again with $node->getOwner() instead? I could also imagine having an admin global config option to set a fallback user explicitly.

I started with getOwner and failed with groupfolders (enabled ACL with restriction to modify). The user provided by getOwner may not have the necessary rights to modify the file.
It seems there's no single owner for groupfolders (with guaranteed access to the file), so that I decided just to go for a random user who has the rights to modify the file.
I totally agree that this is not ideal and could lead to confusion.

To minimize this risk, it could be a two stage fallback.

  1. getOwner
  2. random user

@R0Wi
Copy link
Contributor

R0Wi commented Dec 19, 2024

I'm still not happy with picking a random user tbh

Option 1: we need to introduce a new setting "fallback user", which is globally applied and can be only set by the admin
Option 2: since the user running the OCR process obviously has the permissions to create new files but not to update them, we could also just create a new file next to the original and suffix it with something like <filename>_OCR.pdf.

@XueSheng-GIT
Copy link
Contributor Author

I haven't had Option 2 in mind yet. Could be a good alternative.

Must it be option 1 or 2? I could imagine that each approach has it's advantages and disadvantages so that it may depend on the specific use case which one makes more sense.
Why not adding this admin setting for handling restricted file modification rights to choose between (i) do nothing (log a warning), (ii) create new file (with appendix "_OCR") and (iii) force file versioning with random user?

If you really feel unhappy with this forced versioning (random user) because of potential confusion, it could be helpful to provide some clarity via the label of the current version. Instead of just adding "Before OCR", it could be an addition like "Before OCR (OCR will be done with random user)".

@R0Wi
Copy link
Contributor

R0Wi commented Dec 24, 2024

To keep things simple I would aim for appending a suffix to the file. Otherwise features like for example "notifications" (which are bound to a user) would need to be reworked as well. Could be an option for the future but for now I'd go with the suffix approach.

@XueSheng-GIT would you feel comfortable to implement this?

@XueSheng-GIT
Copy link
Contributor Author

@R0Wi happy new year and sorry for replying late. I'll try to have a look this weekend.

@XueSheng-GIT
Copy link
Contributor Author

I had a quick look at how to add the suffix (e.g. '_OCR') in case the file cannot be updated by user.

My initial idea was to do it in the same way as images are handled right now (because it should be at least a similar situation).

$newFilePath = $originalFileExtension === $newFileExtension ?
$filePath :
$filePath . '.pdf';

Thus, I just wanted to define newFilePath accordingly in case node is not updateable:

if ($originalFileExtension === $newFileExtension) {
        if (!$file->isUpdateable()) {
                // Add suffix '_OCR' if original file cannot be updated
                $fileInfo = pathinfo($filePath);
                $newFilePath = $fileInfo['dirname'] . '/' . $fileInfo['filename'] . '_OCR.' . $originalFileExtension;
        } else {
                $newFilePath = $filePath;
        }
} else {
        $newFilePath = $filePath . '.pdf';
}

But then the workflow ended up in an endless loop because the newly created file triggers the workflow again.
Based on your previous comment (#286 (comment)), I thought that this case is already handled.
I did some additional testing with image files and realised that if you setup workflow_ocr to handle image and pdf files, the workflow will be triggered for the newly created pdf file too (jpg -> ocr -> pdf -> ocr -> pdf).

On the first view, I'm not sure how to prevent this endless loop because the current approach seems to require the fileId which is of course initially unknown in case a new file is created.

private function createNewFileVersion(string $filePath, string $ocrContent, int $fileId, ?int $fileMtime = null) : void {
$dirPath = dirname($filePath);
$filename = basename($filePath);
$this->processingFileAccessor->setCurrentlyProcessedFileId($fileId);
try {
$view = $this->viewFactory->create($dirPath);
// Create new file or file-version with OCR-file
// This will trigger 'postWrite' event which would normally
// add the file to the queue again but this is tackled
// by the processingFileAccessor.
$view->file_put_contents($filename, $ocrContent);
// Restore the original modification time of the non-OCR file
if ($fileMtime !== null) {
$view->touch($filename, $fileMtime);
}
} finally {
$this->processingFileAccessor->setCurrentlyProcessedFileId(null);
}
}

@R0Wi: Is there a reason that workflow_ocr currently does not handle this self-trigger in case of image ocr? Or does ProcessingFileAccessor need some adaption to check the filePath in addition to (or instead of) the fileId?

@R0Wi
Copy link
Contributor

R0Wi commented Jan 4, 2025

@R0Wi: Is there a reason that workflow_ocr currently does not handle this self-trigger in case of image ocr? Or does ProcessingFileAccessor need some adaption to check the filePath in addition to (or instead of) the fileId?

@XueSheng-GIT indeed you're right. This wasn't the intended behavior. Probably no one noticed so far since most of the usecases will handle pdf -> pdf. And even if we go with jpg -> pdf, the newly created pdf will be processed twice but this doesn't cause any harm (even though, it is not necessary).

I guess you're right, maybe ProcessingFileAccessor needs some tweaking. Do you have an idea how we could also handle the "file create" case and break the endless loop there?

...and add suffix '_OCR' to processed file if not.

Signed-off-by: XueSheng-GIT <[email protected]>
@XueSheng-GIT XueSheng-GIT force-pushed the check-update-permissions branch from 6643768 to d7ea3f9 Compare January 5, 2025 06:54
@XueSheng-GIT
Copy link
Contributor Author

I've updated ProcessingFileAccessor to use filePath instead of fileId (while testing I didn't see a good reason to use both). Thus, the endless loop is solved and there's no double processing for image files anymore.

If update permissions for the processed file are missing, the suffix '_OCR' is now added. I've also done some testing with 'Keep original file version' and 'Keep original file modification date'. Everything seems to work well so far.
Of course it would be possible to log a warning in case 'Keep original file version' is set and update permissions are missing.

I've partly updated the unit tests. But at least the following must still be updated to reflect the change from fileId to filePath. Unit tests are quite new to me, thus, I could need some help with this.
https://github.com/R0Wi-DEV/workflow_ocr/blob/3a623a0562d4d03598b23b3ba4c6c6a3b5a7ea02/tests/Unit/Service/OcrServiceTest.php#L447C17-L447C24

@R0Wi would be great if you could have a look at this updated pull and add your comments/changes to it.

@R0Wi
Copy link
Contributor

R0Wi commented Jan 5, 2025

Sure @XueSheng-GIT, thanks for your effort! The solution sounds good to me, I will try to have a closer look as soon as time permits 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OCR fails on file drop (shared upload folder)
2 participants