-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scene hashing and identification #6
Comments
I would vote for Open Subtitles it apparently hashes the first 64k, last 64k, and filesize to identify files. This could be problematic with releases with identical intro's and outro's leaving us with effectively the filesize being the unique data but this might be better than just an MD5 or preferably at least a SHA. |
As for the other discussion regarding similar scene data, we were discussing various perceptual hashing options. Options to be used for comparing data across transcodes and potentially even identifying content in compilations, which would be incredible for some people seeking to identify performers in compilations with no cited sources. However, perceptual hashing is still a bit of a new research field, so there are limited options. phash was one that was mentioned and TMK by facebook for identifying offensive or illegal content for their ThreatDB. It seems that TMK might be the ideal option. However, it doesn't seem like it would work for lookups so much as a confirmation option. I am not an expert in perceptual hashing so I don't know if the data from the p-hash encompasses the whole video in an efficient way or not or if theres a good lookup methodology for scenes that is open source. |
I have another proposal: The issue with hashing the entire file per stash's current implementation, and with OpenSubtitle's approach, is that even changing the metadata of the file is enough to change the resulting hash of both algorithms. I propose that we use ffmpeg's I found that running the md5 muxer was reasonably slow when running for the whole video. I scaled the video stream to 32x32 (from here) and got a result a lot quicker.
I also propose that we request a parity token for submitted hashes, to reduce the possibility of garbage data. This will require extra work for the client to calculate, but is a bigger barrier for submitting garbage checksums. |
I would love to help test this, could we all test a common / popular file? |
I've tried the Open Subtitles hash function , c and go version and i have to say it's very fast and doesn't seem to produce duplicates even with same intro/ending scenes.Apart from reading binary the file at start/end it also adds the filesize to the hash so that might be why. i see a list of videos here for example https://gist.github.com/jsturgis/3b19447b304616f18657 |
Even if two scenes have the exact same outro/intro I find it highly unlikely the OSO hash will produce a collision. The video header contains all kinds of info about framecount, bitrate, size, encoder version, streams, metadata, creation date, etc. It should be plenty to guarantee uniqueness. The advantage of OSO is also that it's an established algorithm which has an implementation in any language you can imagine and which doesn't require ffmpeg. It is somewhat resistant to corruption since it simply ignores the majority of the data. Most importantly it's also lightning fast. Regarding hashing the video stream itself, I've had the same thought, and I think it's a good idea. The only thing I would want to test is how well it works with modifying metadata and remuxing the video stream. There's also the question of whether we'd want to store straight sha1 hashes of the entire file. This would be useful for validating file integrity, which neither of the other two alternatives can do. |
Imho the more hashes we support the better , it doesn't hurt to have some more fieds and match depending on what you have available. |
https://privatezero.github.io/amiapresentation2017/ was a thing I discovered to give a quick 101 on the subject of perceptual hashing with FFMPEG |
FFMPEG hashing is far too slow. It simply takes too long... Same with perceptual hashing which is even slower. If you hash a full video, at least. It takes 5-10 times longer than the length of the video to calculate. |
The challenge with oshash, or any other hash, is that it doesn't recognize reencodes, so you'll end up with dozens of hashes for almost everything. WithoutPants has implemented a dupe detector based on a perceptual hash of the sprite, which should recognize reencodes and different resolutions without issue. It might not work if the length is different, but it would still be immensely useful. I'm very keen on trying it out in stash-box once time allows. |
There has been a bit of discussion on Discord about scene hashing, and I'd like to get my head around how people would expect it to work in the central db.
Stash currently uses MD5 hashing (via
crypto/md5
). It hashes the entire file.I've little to no experience with file hashing, so I'm not sure how the various alternatives compare in terms of collisions and performance. There is also the topic of perceptual hashing which may possibly need its own issue to tease out.
Opensubtitles have their own hashing algorithm that might be worth investigating.
The initial model I was targeting was for a scene to have a number of MD5 hashes associated with them in order to be able to identify a scene by its hash. It sounded like there may have been talk of handling hashes of different types?
Anyway, it'd be nice to get some decisions on this for an initial prototype.
The text was updated successfully, but these errors were encountered: