File hashes: reading from/writing to extended attributes #1204

eduarrrd · 2024-11-23T09:21:12Z

VERSION INFORMATION

Server Version: 5.0.0.60 (9808915)

LOG FILE

N/A

DESCRIPTION

When using Shoko as a data source, starting the query chain with a file hash (/File/Hash/* endpoints) is the most reliable method. However, to do so the file currently needs to be hashed at least twice: by Shoko and by the API client. This is wasteful, especially over the network.

As an alternative I propose for Shoko to

optionally read the hash from an extended attribute (xattr)
optionally write the calculated hash into an xattr
both

This allows for the following workflow

Rescan is triggered
Shoko does its thing, including the filehashing
Shoko does the equivalent of setfattr -n user.$HASHTYPE -v "$HASHVAL" $FILE
Client does a rescan
Client does the equivalent of getfattr -n user.$HASHTYPE -e hex $FILE and finds a value
(Client does a GET /File/Hash/$HASHTYPE and gets fileid to work with)

Same idea idea vice-versa.

FAQ

Why even store a hash? Allows the client to forgo maintaining a database allow for batch-like, stateless operation. Idempotent.
Why not have the client look up by path? Path prefixes are different between client and Shoko due to network mounts, in addition file names may be different in the case of SMB and mapchars/mapposix (enabled by default)
Why xattr? Attaching the hash to the inode gives nice consistency properties: they are preserved on move/hardlink, can be preserved on copy, are transparent to symlinks, and are deleted on delete. No risk of forgetting to deal with a "sidecar file" e.g. moving/renameing one by not the other.
Is it always consistent? No, a normal write to the file will not update the xattr. Given these are video files this is not a that likely. Even if, e.g. for the force-default audio/subtitle scripts, this could be considered a benefit, given there won't be a DB entry for the updated hash.

STEPS TO REPRODUCE

N/A

The text was updated successfully, but these errors were encountered:

da3dsoul · 2024-11-23T13:28:49Z

What are you talking about? Hash twice? Are you referring to AVDump? That isn't ours. We can't pass data to it. That's kind of the point of it.

eduarrrd · 2024-11-23T16:05:18Z

Can you elaborate which part of the workflow I described is confusing to you? Is it the naming? If so: a program that makes calls to an API (like Shoko's REST API) is often called a client, short for "API client". That's the term I have used and the reason I reference Shoko's API endpoints by their paths. I'm not sure why AVDump was brought into the picture.

I'm happy to give more details but I need more than than "what are you talking about". Throw me a bone here.

I

da3dsoul · 2024-11-23T17:21:33Z

I misread. I don't understand the point, though

eduarrrd · 2024-11-23T18:20:40Z

Can you elaborate which part of the workflow I described is confusing to you?

I'm happy to give more details but I need more than than "what are you talking about". Throw me a bone here.

Cazzar · 2024-11-24T15:24:37Z

What issue are you particularly trying to solve that requires the double hashing that you mention in the original issue?

Since I have a few concerns about this approach, particularly being very linux-specific and it is not particularly that portable across OS platforms that we support.

The reason @da3dsoul might have brought AVDump into the question is because that's the only use case within Shoko's code & supported tooling that would cause a file to be hashed a second time, outside of requesting a new hashing operation on an unrecognised file.

Alternatively, we are slowly building on a plugin API that exposes events such as FileHashed which would allow you to create a plugin that does this on your own, see:

ShokoServer/Shoko.Plugin.Abstractions/IShokoEventHandler.cs

Line 25 in 1f19d95

event EventHandler<FileEventArgs> FileHashed;

and you can see a working plugin that one person has done here: https://github.com/fearnlj01/WebhookDump-ShokoPlugin

bigretromike · 2024-11-24T17:48:43Z

Hash information about file was always included in database (or if you are old-school like me in file name of given file). Storing it in extended attributes is fine idea, but as mention before not as portable.
Hash are compute as soon as you add the file to shoko, because without it shoko wouldn't be able to find proper AniDB record.

eduarrrd · 2024-11-24T23:39:08Z

Let me reply to both #1204 (comment) and #1204 (comment) in a single comment:

What issue are you particularly trying to solve that requires the double hashing that you mention in the original issue?

I need to identify a file whose metadata (basepath, name, owner, ctime/mtime/atime, etc.) may be different that what Shoko sees (e.g. Shoko sees /mnt/data/title:subtitle.ext while I see /netshares/a/titlesubtitle.ext (yes, that's indeed codepoint U+F022). The only things I can rely on are the contents and attributes that no other tooling is using. Currently, this means that querying the identity by hash is the only way to this. This is what I'm currently doing ... but to obtain the hash I currently have to execute the hashing operation, which also means reading the entire file. This is what I want to avoid. My approach is to use "attributes that no other tooling is using", e.g. an xattr called user.$HASHTYPE (or maybe user.shoko.$HASHTYPE).

Since I have a few concerns about this approach, particularly being very linux-specific and it is not particularly that portable across OS platforms that we support.

I proposed xattr since I'm familiar with it and I have a working solution. I believe the win64 has both extended attributes and NTFS Alternate Data Streams. I remember reading that WSL1 was implementing linux FS features on NTFS using them. I don't know how complicated the Windows side is, but on Linux the overhead is one system call for reading and one for writing.

Regarding interop, Samba has the vfs_streams_xattr module which exposes ADS to non-Windows SMB clients.

The reason @da3dsoul might have brought AVDump into the question is because that's the only use case within Shoko's code & supported tooling that would cause a file to be hashed a second time, outside of requesting a new hashing operation on an unrecognised file.

I see. To reiterate: (as you've observed) my request is for interop with code not in Shoko.

Alternatively, we are slowly building on a plugin API that exposes events such as FileHashed which would allow you to create a plugin that does this on your own, see:

Without more info I don't think I can realistically pursue this. A plugin API (a binary one as far as I can see) would require me create and maintain code in language I haven't learned, using APIs and libraries I don't know, and set up/maintain CI/CD for this and deal with the artifact distribution. I'm not really seeing docs or stability guarantees either.

Hash information about file was always included in database (or if you are old-school like me in file name of given file). Storing it in extended attributes is fine idea, but as mention before not as portable.

I agree that using the filename is the most straightforward option but I cannot change the original filename/directory structure/etc for archival reasons.

Hash are compute as soon as you add the file to shoko, because without it shoko wouldn't be able to find proper AniDB record.

Certainly.

maxpiva · 2024-11-25T11:06:11Z

I already did go to this rabbit hole. ;). When I was pursing the cloud filesystem support for Shoko, since download the whole file for hashing, was indeed costly...

Deep in Shoko, probably before the Command Refactor, and the WPF Server exists.

There was a file checker that leverage that Shoko maintains in their database the ED2K, MD5 and SHA1 of every file, it was uses probably only by me to check all the files health. ;)

You could use that as a base, to create your own tool, that connect to Shoko database and write the desire video.xattr files or videofolder.xattr or NTFS alternate file stream with such data. .sfv files per example.

You could in the future leverage what @Cazzar is talking about, and create a plugin that leverages the FileHash event and write such file, every time Shoko sees a new file and hash it.

Shoko, maintains a mapping between name and the file in db, so moving the file to other folder, will not trigger the hashing again.

I think @bigretromike used to have metadata files in his collections for other Media Centers, to leverage File based metadata, maybe you should extend such kind of file types.

Per example:
https://kodi.wiki/view/NFO_files/Templates

I do find attractive to maintain a standard file type in every folder (Maybe .nfo), containing the hashes, and minimal data, like AniDB id, etc. In that way, in some dystopian future, if you move such folders or recreate the Shoko from scratch the import could be much faster. Also, interop with other Media Centers could be more straightforward, since reading such file is probably easier than call a custom API, and more if it is a pseudo-standard already supported.

In the above nfo case the uniqueid tag, one could fill that with SHA1, ED2K, MD5 and/or AniDB Id.

eduarrrd · 2024-12-02T12:53:15Z

I already did go to this rabbit hole. ;). When I was pursing the cloud filesystem support for Shoko, since download the whole file for hashing, was indeed costly...

Deep in Shoko, probably before the Command Refactor, and the WPF Server exists.

There was a file checker that leverage that Shoko maintains in their database the ED2K, MD5 and SHA1 of every file, it was uses probably only by me to check all the files health. ;)

You could use that as a base, to create your own tool, that connect to Shoko database and write the desire video.xattr files or videofolder.xattr or NTFS alternate file stream with such data. .sfv files per example.

This has two issues:

This implies creating a sidecar container to the Shoko one since that's the only way to guarantee visibility of the same paths Shoko uses.
The maintenance work in I described in File hashes: reading from/writing to extended attributes #1204 (comment) changes nature but is still there. Now it's just the database schema instead of a C# API. Are there stability guarantees for it? For comparison, looking at Kodi (which naturally does a lot more things), the video database schema is a version 131.

You could in the future leverage what @Cazzar is talking about, and create a plugin that leverages the FileHash event and write such file, every time Shoko sees a new file and hash it.

Shoko, maintains a mapping between name and the file in db, so moving the file to other folder, will not trigger the hashing again.

This is useful to know, thanks. What happens if files have the same name, e.g. trailer.mp4? Is there disambiguation? Or are moves undetected? Or something like "at most one file among files with the same name can move between scans"?

I think @bigretromike used to have metadata files in his collections for other Media Centers, to leverage File based metadata, maybe you should extend such kind of file types.

Per example: https://kodi.wiki/view/NFO_files/Templates

I do find attractive to maintain a standard file type in every folder (Maybe .nfo), containing the hashes, and minimal data, like AniDB id, etc. In that way, in some dystopian future, if you move such folders or recreate the Shoko from scratch the import could be much faster. Also, interop with other Media Centers could be more straightforward, since reading such file is probably easier than call a custom API, and more if it is a pseudo-standard already supported.

In the above nfo case the uniqueid tag, one could fill that with SHA1, ED2K, MD5 and/or AniDB Id.

This is approximately my use case I wrote a tool to go from "arbitrary unstructured files" to something long the lines of

by-name -> .by-name-20241201/
.by-name-20241201/
  Show1Title/            # new directory generated using Shoko data
    tvshow.nfo           # new file generated using Shoko data
    poster.jpg           # new file generated using Shoko data
    S0E3 Other Title.ext # hardlinked from /some/path/randomname_special_thing.ext, name generated using Shoko data
    S0E3 Other Title.nfo # new file generated using Shoko data
    S1EY Title.ext       # hardlinked from /some/path/randomname S12EY.ext, name generated using Shoko data
    S1EY Title.nfo       # new file generated using Shoko data
  Show2Title/
    ...
  ...
.by-name-20241130/
   ...

maxpiva · 2024-12-02T13:30:42Z

This is useful to know, thanks. What happens if files have the same name, e.g. trailer.mp4? Is there disambiguation? Or are moves undetected? Or something like "at most one file among files with the same name can move between scans"?

Filesize is also stored, both are checked against, to bypass a new Hash Calculation.
Table Filenamehash.

I think also renamers touch this table, in that case, in the import you rename the filename. Both filenames are stored. (But I'm not sure at the moment).

Videolocal table has all the hashes and CRC32, videolocal_place the paths where the file is stored (It might be in multiple places) (IN conjunction with import folders prefix path).

This is approximately my use case I wrote a tool to go from "arbitrary unstructured files" to something long the lines of

by-name -> .by-name-20241201/
.by-name-20241201/
  Show1Title/            # new directory generated using Shoko data
    tvshow.nfo           # new file generated using Shoko data
    poster.jpg           # new file generated using Shoko data
    S0E3 Other Title.ext # hardlinked from /some/path/randomname_special_thing.ext, name generated using Shoko data
    S0E3 Other Title.nfo # new file generated using Shoko data
    S1EY Title.ext       # hardlinked from /some/path/randomname S12EY.ext, name generated using Shoko data
    S1EY Title.nfo       # new file generated using Shoko data
  Show2Title/
    ...
  ...
.by-name-20241130/
   ...

Nice, it seems Kodi support multiepisodes nfo in the tvshow, idk others.

I also think or remember, APIs (probably not v3), has and endpoint or a combination, where you provide the file and, in the end, you get all the hashes, and all the anime information. That can be leveraged, if you don't want to connect to the DB.

If you need a custom API endpoint, it will be possible. If you do the hardwork ;) and @ElementalCrisis or other masters approve it.

And of course, you have a third option, join the team, help @Cazzar with the plugin framework, and create the first plugin that do everything automagically.

Cazzar · 2024-12-02T14:04:16Z

To give a final decision on this:

We are not going to implement this into the core of Shoko as it is out of scope for the project itself and as prior discussed it seems to be for a very specific use case.

If you would like to implement this yourself, the answer of the Plugin API has been provided, though also looking at the swagger endpoints might also provide you some information as well.

Cazzar closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024

Cazzar added the Won't Change/Implement label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File hashes: reading from/writing to extended attributes #1204

File hashes: reading from/writing to extended attributes #1204

eduarrrd commented Nov 23, 2024

da3dsoul commented Nov 23, 2024

eduarrrd commented Nov 23, 2024

da3dsoul commented Nov 23, 2024

eduarrrd commented Nov 23, 2024

Cazzar commented Nov 24, 2024

bigretromike commented Nov 24, 2024

eduarrrd commented Nov 24, 2024

maxpiva commented Nov 25, 2024 •

edited

Loading

eduarrrd commented Dec 2, 2024

maxpiva commented Dec 2, 2024 •

edited

Loading

Cazzar commented Dec 2, 2024

File hashes: reading from/writing to extended attributes #1204

File hashes: reading from/writing to extended attributes #1204

Comments

eduarrrd commented Nov 23, 2024

VERSION INFORMATION

LOG FILE

DESCRIPTION

FAQ

STEPS TO REPRODUCE

da3dsoul commented Nov 23, 2024

eduarrrd commented Nov 23, 2024

da3dsoul commented Nov 23, 2024

eduarrrd commented Nov 23, 2024

Cazzar commented Nov 24, 2024

bigretromike commented Nov 24, 2024

eduarrrd commented Nov 24, 2024

maxpiva commented Nov 25, 2024 • edited Loading

eduarrrd commented Dec 2, 2024

maxpiva commented Dec 2, 2024 • edited Loading

Cazzar commented Dec 2, 2024

maxpiva commented Nov 25, 2024 •

edited

Loading

maxpiva commented Dec 2, 2024 •

edited

Loading