Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File hashes: reading from/writing to extended attributes #1204

Closed
eduarrrd opened this issue Nov 23, 2024 · 11 comments
Closed

File hashes: reading from/writing to extended attributes #1204

eduarrrd opened this issue Nov 23, 2024 · 11 comments

Comments

@eduarrrd
Copy link

VERSION INFORMATION

Server Version: 5.0.0.60 (9808915)

LOG FILE

N/A

DESCRIPTION

When using Shoko as a data source, starting the query chain with a file hash (/File/Hash/* endpoints) is the most reliable method. However, to do so the file currently needs to be hashed at least twice: by Shoko and by the API client. This is wasteful, especially over the network.

As an alternative I propose for Shoko to

  • optionally read the hash from an extended attribute (xattr)
  • optionally write the calculated hash into an xattr
  • both

This allows for the following workflow

  1. Rescan is triggered
  2. Shoko does its thing, including the filehashing
  3. Shoko does the equivalent of setfattr -n user.$HASHTYPE -v "$HASHVAL" $FILE
  4. Client does a rescan
  5. Client does the equivalent of getfattr -n user.$HASHTYPE -e hex $FILE and finds a value
  6. (Client does a GET /File/Hash/$HASHTYPE and gets fileid to work with)

Same idea idea vice-versa.

FAQ

  • Why even store a hash? Allows the client to forgo maintaining a database allow for batch-like, stateless operation. Idempotent.
  • Why not have the client look up by path? Path prefixes are different between client and Shoko due to network mounts, in addition file names may be different in the case of SMB and mapchars/mapposix (enabled by default)
  • Why xattr? Attaching the hash to the inode gives nice consistency properties: they are preserved on move/hardlink, can be preserved on copy, are transparent to symlinks, and are deleted on delete. No risk of forgetting to deal with a "sidecar file" e.g. moving/renameing one by not the other.
  • Is it always consistent? No, a normal write to the file will not update the xattr. Given these are video files this is not a that likely. Even if, e.g. for the force-default audio/subtitle scripts, this could be considered a benefit, given there won't be a DB entry for the updated hash.

STEPS TO REPRODUCE

N/A

@da3dsoul
Copy link
Member

What are you talking about? Hash twice? Are you referring to AVDump? That isn't ours. We can't pass data to it. That's kind of the point of it.

@eduarrrd
Copy link
Author

Can you elaborate which part of the workflow I described is confusing to you? Is it the naming? If so: a program that makes calls to an API (like Shoko's REST API) is often called a client, short for "API client". That's the term I have used and the reason I reference Shoko's API endpoints by their paths. I'm not sure why AVDump was brought into the picture.

I'm happy to give more details but I need more than than "what are you talking about". Throw me a bone here.

I

@da3dsoul
Copy link
Member

I misread. I don't understand the point, though

@eduarrrd
Copy link
Author

Can you elaborate which part of the workflow I described is confusing to you?

I'm happy to give more details but I need more than than "what are you talking about". Throw me a bone here.

@Cazzar
Copy link
Member

Cazzar commented Nov 24, 2024

What issue are you particularly trying to solve that requires the double hashing that you mention in the original issue?

Since I have a few concerns about this approach, particularly being very linux-specific and it is not particularly that portable across OS platforms that we support.

The reason @da3dsoul might have brought AVDump into the question is because that's the only use case within Shoko's code & supported tooling that would cause a file to be hashed a second time, outside of requesting a new hashing operation on an unrecognised file.

Alternatively, we are slowly building on a plugin API that exposes events such as FileHashed which would allow you to create a plugin that does this on your own, see:

event EventHandler<FileEventArgs> FileHashed;
and you can see a working plugin that one person has done here: https://github.com/fearnlj01/WebhookDump-ShokoPlugin

@bigretromike
Copy link
Contributor

Hash information about file was always included in database (or if you are old-school like me in file name of given file). Storing it in extended attributes is fine idea, but as mention before not as portable.
Hash are compute as soon as you add the file to shoko, because without it shoko wouldn't be able to find proper AniDB record.

@eduarrrd
Copy link
Author

Let me reply to both #1204 (comment) and #1204 (comment) in a single comment:

What issue are you particularly trying to solve that requires the double hashing that you mention in the original issue?

I need to identify a file whose metadata (basepath, name, owner, ctime/mtime/atime, etc.) may be different that what Shoko sees (e.g. Shoko sees /mnt/data/title:subtitle.ext while I see /netshares/a/titlesubtitle.ext (yes, that's indeed codepoint U+F022). The only things I can rely on are the contents and attributes that no other tooling is using. Currently, this means that querying the identity by hash is the only way to this. This is what I'm currently doing ... but to obtain the hash I currently have to execute the hashing operation, which also means reading the entire file. This is what I want to avoid. My approach is to use "attributes that no other tooling is using", e.g. an xattr called user.$HASHTYPE (or maybe user.shoko.$HASHTYPE).

Since I have a few concerns about this approach, particularly being very linux-specific and it is not particularly that portable across OS platforms that we support.

I proposed xattr since I'm familiar with it and I have a working solution. I believe the win64 has both extended attributes and NTFS Alternate Data Streams. I remember reading that WSL1 was implementing linux FS features on NTFS using them. I don't know how complicated the Windows side is, but on Linux the overhead is one system call for reading and one for writing.

Regarding interop, Samba has the vfs_streams_xattr module which exposes ADS to non-Windows SMB clients.

The reason @da3dsoul might have brought AVDump into the question is because that's the only use case within Shoko's code & supported tooling that would cause a file to be hashed a second time, outside of requesting a new hashing operation on an unrecognised file.

I see. To reiterate: (as you've observed) my request is for interop with code not in Shoko.

Alternatively, we are slowly building on a plugin API that exposes events such as FileHashed which would allow you to create a plugin that does this on your own, see:

Without more info I don't think I can realistically pursue this. A plugin API (a binary one as far as I can see) would require me create and maintain code in language I haven't learned, using APIs and libraries I don't know, and set up/maintain CI/CD for this and deal with the artifact distribution. I'm not really seeing docs or stability guarantees either.

Hash information about file was always included in database (or if you are old-school like me in file name of given file). Storing it in extended attributes is fine idea, but as mention before not as portable.

I agree that using the filename is the most straightforward option but I cannot change the original filename/directory structure/etc for archival reasons.

Hash are compute as soon as you add the file to shoko, because without it shoko wouldn't be able to find proper AniDB record.

Certainly.

@maxpiva
Copy link
Member

maxpiva commented Nov 25, 2024

I already did go to this rabbit hole. ;). When I was pursing the cloud filesystem support for Shoko, since download the whole file for hashing, was indeed costly...

Deep in Shoko, probably before the Command Refactor, and the WPF Server exists.

There was a file checker that leverage that Shoko maintains in their database the ED2K, MD5 and SHA1 of every file, it was uses probably only by me to check all the files health. ;)

You could use that as a base, to create your own tool, that connect to Shoko database and write the desire video.xattr files or videofolder.xattr or NTFS alternate file stream with such data. .sfv files per example.

You could in the future leverage what @Cazzar is talking about, and create a plugin that leverages the FileHash event and write such file, every time Shoko sees a new file and hash it.

Shoko, maintains a mapping between name and the file in db, so moving the file to other folder, will not trigger the hashing again.

I think @bigretromike used to have metadata files in his collections for other Media Centers, to leverage File based metadata, maybe you should extend such kind of file types.

Per example:
https://kodi.wiki/view/NFO_files/Templates

I do find attractive to maintain a standard file type in every folder (Maybe .nfo), containing the hashes, and minimal data, like AniDB id, etc. In that way, in some dystopian future, if you move such folders or recreate the Shoko from scratch the import could be much faster. Also, interop with other Media Centers could be more straightforward, since reading such file is probably easier than call a custom API, and more if it is a pseudo-standard already supported.

In the above nfo case the uniqueid tag, one could fill that with SHA1, ED2K, MD5 and/or AniDB Id.

@eduarrrd
Copy link
Author

eduarrrd commented Dec 2, 2024

I already did go to this rabbit hole. ;). When I was pursing the cloud filesystem support for Shoko, since download the whole file for hashing, was indeed costly...

Deep in Shoko, probably before the Command Refactor, and the WPF Server exists.

There was a file checker that leverage that Shoko maintains in their database the ED2K, MD5 and SHA1 of every file, it was uses probably only by me to check all the files health. ;)

You could use that as a base, to create your own tool, that connect to Shoko database and write the desire video.xattr files or videofolder.xattr or NTFS alternate file stream with such data. .sfv files per example.

This has two issues:

  1. This implies creating a sidecar container to the Shoko one since that's the only way to guarantee visibility of the same paths Shoko uses.
  2. The maintenance work in I described in File hashes: reading from/writing to extended attributes #1204 (comment) changes nature but is still there. Now it's just the database schema instead of a C# API. Are there stability guarantees for it? For comparison, looking at Kodi (which naturally does a lot more things), the video database schema is a version 131.

You could in the future leverage what @Cazzar is talking about, and create a plugin that leverages the FileHash event and write such file, every time Shoko sees a new file and hash it.

Shoko, maintains a mapping between name and the file in db, so moving the file to other folder, will not trigger the hashing again.

This is useful to know, thanks. What happens if files have the same name, e.g. trailer.mp4? Is there disambiguation? Or are moves undetected? Or something like "at most one file among files with the same name can move between scans"?

I think @bigretromike used to have metadata files in his collections for other Media Centers, to leverage File based metadata, maybe you should extend such kind of file types.

Per example: https://kodi.wiki/view/NFO_files/Templates

I do find attractive to maintain a standard file type in every folder (Maybe .nfo), containing the hashes, and minimal data, like AniDB id, etc. In that way, in some dystopian future, if you move such folders or recreate the Shoko from scratch the import could be much faster. Also, interop with other Media Centers could be more straightforward, since reading such file is probably easier than call a custom API, and more if it is a pseudo-standard already supported.

In the above nfo case the uniqueid tag, one could fill that with SHA1, ED2K, MD5 and/or AniDB Id.

This is approximately my use case I wrote a tool to go from "arbitrary unstructured files" to something long the lines of

by-name -> .by-name-20241201/
.by-name-20241201/
  Show1Title/            # new directory generated using Shoko data
    tvshow.nfo           # new file generated using Shoko data
    poster.jpg           # new file generated using Shoko data
    S0E3 Other Title.ext # hardlinked from /some/path/randomname_special_thing.ext, name generated using Shoko data
    S0E3 Other Title.nfo # new file generated using Shoko data
    S1EY Title.ext       # hardlinked from /some/path/randomname S12EY.ext, name generated using Shoko data
    S1EY Title.nfo       # new file generated using Shoko data
  Show2Title/
    ...
  ...
.by-name-20241130/
   ...

@maxpiva
Copy link
Member

maxpiva commented Dec 2, 2024

This is useful to know, thanks. What happens if files have the same name, e.g. trailer.mp4? Is there disambiguation? Or are moves undetected? Or something like "at most one file among files with the same name can move between scans"?

Filesize is also stored, both are checked against, to bypass a new Hash Calculation.
Table Filenamehash.

I think also renamers touch this table, in that case, in the import you rename the filename. Both filenames are stored. (But I'm not sure at the moment).

Videolocal table has all the hashes and CRC32, videolocal_place the paths where the file is stored (It might be in multiple places) (IN conjunction with import folders prefix path).

This is approximately my use case I wrote a tool to go from "arbitrary unstructured files" to something long the lines of

by-name -> .by-name-20241201/
.by-name-20241201/
  Show1Title/            # new directory generated using Shoko data
    tvshow.nfo           # new file generated using Shoko data
    poster.jpg           # new file generated using Shoko data
    S0E3 Other Title.ext # hardlinked from /some/path/randomname_special_thing.ext, name generated using Shoko data
    S0E3 Other Title.nfo # new file generated using Shoko data
    S1EY Title.ext       # hardlinked from /some/path/randomname S12EY.ext, name generated using Shoko data
    S1EY Title.nfo       # new file generated using Shoko data
  Show2Title/
    ...
  ...
.by-name-20241130/
   ...

Nice, it seems Kodi support multiepisodes nfo in the tvshow, idk others.

I also think or remember, APIs (probably not v3), has and endpoint or a combination, where you provide the file and, in the end, you get all the hashes, and all the anime information. That can be leveraged, if you don't want to connect to the DB.

If you need a custom API endpoint, it will be possible. If you do the hardwork ;) and @ElementalCrisis or other masters approve it.

And of course, you have a third option, join the team, help @Cazzar with the plugin framework, and create the first plugin that do everything automagically.

@Cazzar
Copy link
Member

Cazzar commented Dec 2, 2024

To give a final decision on this:

We are not going to implement this into the core of Shoko as it is out of scope for the project itself and as prior discussed it seems to be for a very specific use case.

If you would like to implement this yourself, the answer of the Plugin API has been provided, though also looking at the swagger endpoints might also provide you some information as well.

@Cazzar Cazzar closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants