Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom "format detection" Attribute #14

Open
darkstar opened this issue Apr 2, 2018 · 3 comments
Open

Custom "format detection" Attribute #14

darkstar opened this issue Apr 2, 2018 · 3 comments

Comments

@darkstar
Copy link
Contributor

darkstar commented Apr 2, 2018

Some file formats lack a reliable magic number at the start, and have no clear file name pattern.

In these cases it would be good to have an attribute that takes a delegate which can be used to check, given a Stream (or EndianBinaryReader), if the format would be able to process the file in question., for example by reading in a header and doing a sanity check on the values.

xdanieldzd added a commit that referenced this issue Apr 5, 2018
Added new FormatDetectionAttribute for file formats, allows specifying a function to be executed when Scarlet attempts detection, ex. to sanity check header values when no clear magic numbers are available, see issue #14; made P5BustupBIN container format use new attribute
@xdanieldzd
Copy link
Owner

Just added some functionality for this in the form of the FormatDetectionAttribute, plus an example in the P5BustupBIN container format.

You can now specify a static method to be executed during file detection, which should return either true when the file appears to be valid (header values appear sane, etc.), or false when it appears invalid (ex. actual filesize is less than what the header specifies). This should work fine in conjunction with the existing attributes as well. I'm open for suggestions on how to improve this, too, it might be a bit "quick and dirty" right now.

@darkstar
Copy link
Contributor Author

darkstar commented Apr 18, 2018

Yes, it seems to work, however, the FormatDetection should not be run as alternative ("OR") to the other detections, but as additional step ("AND").

Two examples:

If the FilenamePattern indicates a match, and the FormatDetection function indicates no match, then the end result should be no match. I ran into this when trying to define a match based on file name pattern AND a detection format

If the filename does not match, but the FormatDetection indicates a match, the end result should be "no match" (or rather, in this case, the FormatDetection should probably not be run at all). This happens with the P5bustupBIN giving false positives on some Disgaea3 files, which have a totally different extension (.pac instead of .bin/.dds2) but the P5 ContainerFormat still tries to unpack it because the (rather simple/generic) format detection flags it as a match. See my PR for a slight improvement to the detection function (still not perfect -- for the PAC files I chose to do a full file header verification to reduce false positives to a minimum)

So, to summarize, I think it should work like that:

  • The magic number attribute is a MUST. If it does not match, there is no use in trying to process the file with the plugin in question
  • The detection function is a MUST. If it indicates a non-match, there is no need to try and process the file. It should be written to filter out "false positives" as much as possible
  • The file name match is more like a hint to the program, that not every plugin should be tried on every file. It "pre-selects" the possible plugins

The rationale is this: There is probably little use in trying to, say, decompress an archive where you know the magic number is incorrect, or where you know (by some heuristic in the detection function) that it "looks" invalid, because the developer could not possibly have foreseen how such a file should be handled (otherwise he would have put the correct magic number in, or changed his detection function)

@darkstar
Copy link
Contributor Author

After thinking about it a bit more, this might be a simpler heuristic:
First, check every MagicNumber attribute. At least one of them must match. If there is no such attribute, default to "matched".
Then, check every DetectionFunction attribute. At least one of them must match. If there is no such attribute, default to "matched".
If the AND of these two results is a match, check the filename pattern. If it also matches, use the corresponding plugin. If it does not match, "fail" the match but let the user override this fail into a "matched" by providing a command line argument, e.g "-ignoreFilenamePatterns". Or, maybe even better, automatically ignore the FilenamePattern if none of the plugins matches all three attribute types and just try every plugin that matches at least the first two (MagicNumber, DetectionFunction)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants