-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: YARA-X dumper module #50
Conversation
The API proposed for the The thought process should be quite the opposite: I need to dump the information produced by a list of module names, Which abstractions or building blocks do I need for implementing this into The must obvious abstraction to me is: given a This is the right abstracion: a crate that is able to produce YAML output for The YAML protobuf customization options don't need to be in the Regarding the protobuf customization options, the boolean enum MyFlags {
Foo = 0x01;
Bar = 0x02;
Baz = 0x04;
}
enum MyStruct {
MyFlags flags = 1 [(yara.field_options).yaml_fmt = "f:MyFlags"];
} The YAML output for this field could be: flags: 0x3 # Foo | Bar Notice the additional comment dissecting the flags. Of course, this crate would also support colorful YAML output using ANSI escape codes, which is something independent from formatting options (we may want a human-friendly YAML output without colors for saving them in a text file) With these building blocks implementing the |
Let's go back to the drawing board... The only thing we really need here is a crate (let's call it This crate doesn't need to produce JSON, or anything else, it's just a protobuf -> YAML converter. The two special features we need in this crate are:
Once we have this, the remaining bits are very straightforward. |
Thank you for suggestions. Basically I agree with everything.
I have removed all other forms of serialization and cleaned the crate to provide only protobuf -> YAML conversion
This is implemented by
I have done this as you suggested and used one unified In case you have any other notes/suggestions feel free to let me know. |
This is much better now, but it still needs a bit polishing. I'm refactoring a few things to provide a mechanism for passing arbitrary data to any module by name, without having to use the trick of creating a dummy rule that imports that module. Once I have it we will be able to remove that part of the code and make it cleaner. In the meanwhile I'll review other aspects of the code. |
yara-x-cli/src/commands/dump.rs
Outdated
.help("Name of the module or comma-separated list of modules to be used for parsing") | ||
.required(false) | ||
.action(ArgAction::Append) | ||
.value_parser(get_builtin_modules_names()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all modules are able to produce a result for the dump command, for instance, the time
module doesn't produce anything. So, including the whole list of built-in modules is not the best option here. We better remove the get_builtin_modules_names
functions and include here a fixed list of allowed modules, which should contain only the modules that parse some file format and produce metadata about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about this, but isn't maintaining another list of possible modules unnecessary addition? Right now the modules that do not produce any output are filtered by mod_output.compute_size_dyn() != 0
. I am not sure if it is a performance issue, if yes sure I can make a list of just modules that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I think it's better to be explicit about the list of modules that are supported by the dump
command at the CLI level, instead for taking the whole list of built-in modules from get_builtin_modules_names
. There are variety of reasons for that, for example there are modules only for testing, like test_proto2
and test_proto3
. Those actually produce results, but we don't want them as possible options for the dump
command. Also, we may have at some point modules that are there but are not stable enough for public use. So, it's much more simpler if the possible arguments for the dump
command is explicitly controlled where the command is implemented.
yara-x-cli/src/commands/dump.rs
Outdated
// # Returns | ||
// | ||
// * `true` if the module output is valid, `false` otherwise. | ||
fn module_is_valid(mod_output: &dyn MessageDyn) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite convinced about all this logic that determines if the output of a module is valid. It requires that you mark a field in the proto and makes things more complex with too little gain. If you the user uses the command-line for dumping the result of the pe
module and he passes a non-PE file, the result will be a very small YAML/JSON where the is_pe
field is set to false
, that's ok to me.
The "auto" option, which allows to pass only the file and let the CLI figure out which module make sense to be taken into account and which not, is a nice addition, but it can be implemented by putting the logic in the CLI tool itself, instead of having to bake all that validity check into yara-x
itself.
The steps the CLI would take are:
- Given some
&[u8]
with the content of the file, askyara-x
to produce the output for module "pe" (or any other module name). This is the logic I'm implementing now, and will return a protobuf. - As the CLI already have access to the protobuf, it the validation logic can be executed by the CLI before passing the protobuf to the serializer of obtaining a text representation of it.
- Based in the output format requested, use
protobuf_json_mapping::print_to_string
or our own YAML serializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you the user uses the command-line for dumping the result of the pe module and he passes a non-PE file, the result will be a very small YAML/JSON where the is_pe field is set to false, that's ok to me.
This is already implemented, if the user manually specifies the module he wants to use. The output will be from given module even when is_pe
or any other flag is set to false or not set at all. All complexity comes from automatic selection. Some modules as pe
or lnk
has a validity flag built-in (is_pe
or is_lnk
) but macho (or others don't have this option. Therefore it is right now a bit module specific and I wanted some universal solution by marking certain field as a flag. This is unified across all modules and easy to validate. But I am open to another solutions that would not require to add another specification to protobuf itself.
As the CLI already have access to the protobuf, it the validation logic can be executed by the CLI before passing the protobuf to the serializer of obtaining a text representation of it.
I am not sure what is the dependancy to yara-x
right now. The module_is_valid()
function takes protobuf and just checks the flag. I am not sure what would be another option. Without this flag I can manually traverse the protobuf and search for some kind of field that would tell me "ok this is valid" but as it differs across the modules it would be a bit complex and with a new module there is a chance that something would have to be added into the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already implemented, if the user manually specifies the module he wants to use. The output will be from given module even when is_pe or any other flag is set to false or not set at all. All complexity comes from automatic selection. Some modules as pe or lnk has a validity flag built-in (is_pe or is_lnk) but macho (or others don't have this option. Therefore it is right now a bit module specific and I wanted some universal solution by marking certain field as a flag. This is unified across all modules and easy to validate. But I am open to another solutions that would not require to add another specification to protobuf itself.
I like the feature. I like that the dump
command produces only the output for the file types that make sense, instead of blindly dumping the results of all modules. What I don't like is that this functionality is embedded in yara-x
, by means of the having to declare in the .proto
a field that determines whether the output is valid or not.
I believe that all those checks also correspond to the CLI tool itself, not to the yara-x
library.
I am not sure what is the dependancy to yara-x right now. The module_is_valid() function takes protobuf and just checks the flag. I am not sure what would be another option. Without this flag I can manually traverse the protobuf and search for some kind of field that would tell me "ok this is valid" but as it differs across the modules it would be a bit complex and with a new module there is a chance that something would have to be added into the code.
What I mean with dependency to yara-x
is that the CLI tool, in order to determine which module it should include in the output, is relying on the validity flag provided by the yara-x
library itself. This validity logic is something that probably only the CLI will use, and its better implemented at the CLI level. The problem you have now is that it's imposible to inspect the result produced by each module at the CLI level, and therefore you don't have access to the is_pe
or is_lnk
fields, but that's one of the problems I plan to solve in the API I'm working on.
Take a look at this PR: #52 It introduces two new functions that allow invoking a module without having to create a dummy YARA rule. Basically you would to something like: let elf_info = yara_x::mods::invoke_mod::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod::<yara_x::mods::Lnk>(data)?; With that you will get the Rust structure corresponding to each module. There's another version that returns let elf_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::Lnk>(data)?; |
…o dylib versions changed to strings and tests added
8c25e68
to
2209400
Compare
Thank you, I have just pushed the latest version which makes an use of this new API and should incorporate all of you suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been reviewing this PR more in-depth and found some other issues. The most important issue is that nested structures are not rendered correctly. For example, if I add another nested structure to OptionalNested
, like this...
message OptionalNested {
optional uint32 onested1 = 1;
optional uint64 onested2 = 2 [(dumper.field_options).yaml_fmt = "x"];
map<string, string> map_string_string = 3;
optional Nested2 nested = 4;
}
message Nested2 {
optional string foo = 1;
}
The produced YAML is not correct:
field1: 0x7b
field2: "test"
field3: "test\ntest"
segments:
- nested1: 456
nested2: 0x315
timestamp: 123456789 # 1973-11-29 21:33:09 UTC
- nested1: 100000
nested2: 0x30d40
timestamp: 999999999 # 2001-09-09 01:46:39 UTC
optional:
onested1: 123
onested2: 0x1c8
map_string_string:
"foo\nfoo": "bar\nbar"
nested:
foo: "foo"
Notice that the foo
field (which belongs to nested
) is not correctly aligned under nested
. In the light of these findings I decided to give it a try and draft my own implementation for the protobuf to YAML serializer:
https://github.com/VirusTotal/yara-x/commits/yaml-serializer
My implementation is only an incomplete draft, but it solves the issue with nested structures and tries to address some other problems with the API design. For example, using a writer instead of a String
for obtaining the final YAML. This is a more flexible that can be used for outputting the YAML directly to a file without having to allocate a string containing the whole YAML.
Hi Victor, thank you for you review. I have addressed your issues and added all features into your template in #53. We can continue to discuss it there. Thank you! |
Merged in #53 |
Dumping module for YARA-X. This module can be run by
yr dump <binary-file> <options>
, where options are not required and can be one of:--modules/-m=<module>
and--output-format/-o=<format>
.Currently supported output formats are:
Supported modules are all built-in modules that are retrieved in
build.rs
.This module takes binary file and dumps parsed information to STDOUT. Modules used for parsing can be either selected and forced by using
--modules/-m=<module>
option or if left empty it is automatically selected viavalidity_flag
. This flag determines if module output is valid and therefore considered as successfully parsed. Validity flag is described inModule Developer's Guide.md
. In terms of displayed output various formats are supported. User can specify also multiple modules used for parsing and output of all modules will be shown. Format can be selected via--output-format/-o=<format>
. If left emptyhuman-readable
format is automatically selected. This format is basically valid YAML with additional colors and comments. This format also supports additional integer representation which can be selected by protobuf field descriptor and is also described inModule Developer's Guide.md
.Together with this I have also added parametrized tests for end2end testing and did a minor change in
macho
module, where dylib version was represented as an integer. It would make much more sense to have this as a string (last two digits represents patch version, previous two minor version and rest is major version number). cc @latonisSo far I have added validity flag into
macho
andlnk
module. Integer field representation descriptors were added only intomacho
module. I think module author should do both while developing the module so feel free to add it into other modules that produces output or change it in existing modules.