Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: YARA-X dumper module #50

Closed
wants to merge 26 commits into from

Conversation

TommYDeeee
Copy link
Contributor

Dumping module for YARA-X. This module can be run by yr dump <binary-file> <options>, where options are not required and can be one of: --modules/-m=<module> and --output-format/-o=<format>.
Currently supported output formats are:

  • json
  • xml
  • toml
  • yaml
  • human-readable

Supported modules are all built-in modules that are retrieved in build.rs.
This module takes binary file and dumps parsed information to STDOUT. Modules used for parsing can be either selected and forced by using --modules/-m=<module> option or if left empty it is automatically selected via validity_flag. This flag determines if module output is valid and therefore considered as successfully parsed. Validity flag is described in Module Developer's Guide.md. In terms of displayed output various formats are supported. User can specify also multiple modules used for parsing and output of all modules will be shown. Format can be selected via --output-format/-o=<format>. If left empty human-readable format is automatically selected. This format is basically valid YAML with additional colors and comments. This format also supports additional integer representation which can be selected by protobuf field descriptor and is also described in Module Developer's Guide.md.

Together with this I have also added parametrized tests for end2end testing and did a minor change in macho module, where dylib version was represented as an integer. It would make much more sense to have this as a string (last two digits represents patch version, previous two minor version and rest is major version number). cc @latonis

So far I have added validity flag into macho and lnk module. Integer field representation descriptors were added only into macho module. I think module author should do both while developing the module so feel free to add it into other modules that produces output or change it in existing modules.

@plusvic
Copy link
Member

plusvic commented Nov 7, 2023

The API proposed for the yara-x-dump crate is not the most appropriate one in my opinion, as it creates a dependency on the yara-x crate itself, because it needs the compiler and scanner for producing the module's output. It looks like the thought process was: I need to dump the information produced by a list of module names, let's create an API that does exactly that, put it as an independent crate, and call it from yara-cli.

The thought process should be quite the opposite: I need to dump the information produced by a list of module names, Which abstractions or building blocks do I need for implementing this into yara-cli?.

The must obvious abstraction to me is: given a &dyn MessageDyn containing the data and specification for a protobuf message, let's produce a text representation of it in one the supported formats. For the JSON format it's straightforward, as we already have the protobuf_json_mapping crate. For very basic YAML output we could simply use serde_yaml, but we want a human-friendly YAML output that can be controlled from the corresponding .proto files, so that's where the our crate should specialise on.

This is the right abstracion: a crate that is able to produce YAML output for &dyn MessageDyn messages produced by the protobuf crate, but allowing customization of the produced YAML via protobuf options. This has potential for becoming a fully standalone crate that anyone can use in its own project!

The YAML protobuf customization options don't need to be in the yara.proto file (in the yara-x-proto crate). They could reside in some other yaml.proto file located in this independent crate, which would allow to implement it as a truly independent crate that doesn't depend on YARA-X at all, not even on the tiny yara-x-proto crate. But for now we could simply use the yara.proto file.

Regarding the protobuf customization options, the boolean hex_value is ok ([(yara.field_options).hex_value = true]), but I'm thinking on a more flexible mechanism, like [(yara.field_options).yaml_fmt = "x"]. Another format option could be "t" for a timestamp, and "f:SomeEnumType" for a fields that are a set of flags represented by a protobuf enum named SomeEnumType. Example:

enum MyFlags {
   Foo = 0x01; 
   Bar = 0x02;
   Baz = 0x04;
}

enum MyStruct {
   MyFlags flags = 1 [(yara.field_options).yaml_fmt = "f:MyFlags"];
}

The YAML output for this field could be:

flags: 0x3   #  Foo | Bar

Notice the additional comment dissecting the flags.

Of course, this crate would also support colorful YAML output using ANSI escape codes, which is something independent from formatting options (we may want a human-friendly YAML output without colors for saving them in a text file)

With these building blocks implementing the yr dump command in the CLI is very easy, but we need to implement the building blocks first, and they should be implemented in way that is as independent from YARA as possible.

@plusvic plusvic closed this Nov 7, 2023
@plusvic plusvic reopened this Nov 7, 2023
@plusvic
Copy link
Member

plusvic commented Nov 8, 2023

Let's go back to the drawing board...

The only thing we really need here is a crate (let's call it yara-x-proto-yaml) that does exactly one thing, take a &dyn MessageDyn and return a String with the message encoded in YAML format. Nothing else.

This crate doesn't need to produce JSON, or anything else, it's just a protobuf -> YAML converter. The two special features we need in this crate are:

  1. Optional colors in the YAML output (no colors by default)
  2. In your .proto files can use options for controlling some aspects of the YAML generation, like whether an integer should be rendered in hex, or whether a field is a timestamp that should be accompanied by a comment with the date and time in human readable format.

Once we have this, the remaining bits are very straightforward.

yara-x-cli/src/commands/dump.rs Outdated Show resolved Hide resolved
yara-x-dump/src/serializer.rs Outdated Show resolved Hide resolved
yara-x-dump/src/serializer.rs Outdated Show resolved Hide resolved
yara-x-dump/src/serializer.rs Outdated Show resolved Hide resolved
@TommYDeeee
Copy link
Contributor Author

Let's go back to the drawing board...

The only thing we really need here is a crate (let's call it yara-x-proto-yaml) that does exactly one thing, take a &dyn MessageDyn and return a String with the message encoded in YAML format. Nothing else.

This crate doesn't need to produce JSON, or anything else, it's just a protobuf -> YAML converter. The two special features we need in this crate are:

  1. Optional colors in the YAML output (no colors by default)
  2. In your .proto files can use options for controlling some aspects of the YAML generation, like whether an integer should be rendered in hex, or whether a field is a timestamp that should be accompanied by a comment with the date and time in human readable format.

Once we have this, the remaining bits are very straightforward.

Thank you for suggestions. Basically I agree with everything.

This crate doesn't need to produce JSON, or anything else, it's just a protobuf -> YAML converter

I have removed all other forms of serialization and cleaned the crate to provide only protobuf -> YAML conversion

Optional colors in the YAML output (no colors by default)

This is implemented by yansi::Pait::disable() method that turns off all ANSI colors, it works for both YAML and JSON. I have also implemented colors for json using external library just for this it also pretty-formats the string so even with disabled colors the output is more readable.

In your .proto files can use options for controlling some aspects of the YAML generation, like whether an integer should be rendered in hex, or whether a field is a timestamp that should be accompanied by a comment with the date and time in human readable format.

I have done this as you suggested and used one unified (yara.field_options).yaml_fmt descriptor that takes string as an input. Right now I only look for "x" for hexadecimal and "t" for timestamp, this can be easily extended in the future. Hexadecimal value is used to replace the original integer one and timestamp is printed as a comment next to the integer version. In future I would like to also implement the flags, but at this point I need to figure out how to get an enum with specified name from the code itself, it looks like there is only a getter for fields.

In case you have any other notes/suggestions feel free to let me know.

@plusvic
Copy link
Member

plusvic commented Nov 9, 2023

This is much better now, but it still needs a bit polishing. I'm refactoring a few things to provide a mechanism for passing arbitrary data to any module by name, without having to use the trick of creating a dummy rule that imports that module. Once I have it we will be able to remove that part of the code and make it cleaner.

In the meanwhile I'll review other aspects of the code.

.help("Name of the module or comma-separated list of modules to be used for parsing")
.required(false)
.action(ArgAction::Append)
.value_parser(get_builtin_modules_names()),
Copy link
Member

@plusvic plusvic Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all modules are able to produce a result for the dump command, for instance, the time module doesn't produce anything. So, including the whole list of built-in modules is not the best option here. We better remove the get_builtin_modules_names functions and include here a fixed list of allowed modules, which should contain only the modules that parse some file format and produce metadata about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this, but isn't maintaining another list of possible modules unnecessary addition? Right now the modules that do not produce any output are filtered by mod_output.compute_size_dyn() != 0. I am not sure if it is a performance issue, if yes sure I can make a list of just modules that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think it's better to be explicit about the list of modules that are supported by the dump command at the CLI level, instead for taking the whole list of built-in modules from get_builtin_modules_names. There are variety of reasons for that, for example there are modules only for testing, like test_proto2 and test_proto3. Those actually produce results, but we don't want them as possible options for the dump command. Also, we may have at some point modules that are there but are not stable enough for public use. So, it's much more simpler if the possible arguments for the dump command is explicitly controlled where the command is implemented.

// # Returns
//
// * `true` if the module output is valid, `false` otherwise.
fn module_is_valid(mod_output: &dyn MessageDyn) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite convinced about all this logic that determines if the output of a module is valid. It requires that you mark a field in the proto and makes things more complex with too little gain. If you the user uses the command-line for dumping the result of the pe module and he passes a non-PE file, the result will be a very small YAML/JSON where the is_pe field is set to false, that's ok to me.

The "auto" option, which allows to pass only the file and let the CLI figure out which module make sense to be taken into account and which not, is a nice addition, but it can be implemented by putting the logic in the CLI tool itself, instead of having to bake all that validity check into yara-x itself.

The steps the CLI would take are:

  1. Given some &[u8] with the content of the file, ask yara-x to produce the output for module "pe" (or any other module name). This is the logic I'm implementing now, and will return a protobuf.
  2. As the CLI already have access to the protobuf, it the validation logic can be executed by the CLI before passing the protobuf to the serializer of obtaining a text representation of it.
  3. Based in the output format requested, use protobuf_json_mapping::print_to_string or our own YAML serializer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you the user uses the command-line for dumping the result of the pe module and he passes a non-PE file, the result will be a very small YAML/JSON where the is_pe field is set to false, that's ok to me.

This is already implemented, if the user manually specifies the module he wants to use. The output will be from given module even when is_pe or any other flag is set to false or not set at all. All complexity comes from automatic selection. Some modules as pe or lnk has a validity flag built-in (is_pe or is_lnk) but macho (or others don't have this option. Therefore it is right now a bit module specific and I wanted some universal solution by marking certain field as a flag. This is unified across all modules and easy to validate. But I am open to another solutions that would not require to add another specification to protobuf itself.

As the CLI already have access to the protobuf, it the validation logic can be executed by the CLI before passing the protobuf to the serializer of obtaining a text representation of it.

I am not sure what is the dependancy to yara-x right now. The module_is_valid() function takes protobuf and just checks the flag. I am not sure what would be another option. Without this flag I can manually traverse the protobuf and search for some kind of field that would tell me "ok this is valid" but as it differs across the modules it would be a bit complex and with a new module there is a chance that something would have to be added into the code.

Copy link
Member

@plusvic plusvic Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already implemented, if the user manually specifies the module he wants to use. The output will be from given module even when is_pe or any other flag is set to false or not set at all. All complexity comes from automatic selection. Some modules as pe or lnk has a validity flag built-in (is_pe or is_lnk) but macho (or others don't have this option. Therefore it is right now a bit module specific and I wanted some universal solution by marking certain field as a flag. This is unified across all modules and easy to validate. But I am open to another solutions that would not require to add another specification to protobuf itself.

I like the feature. I like that the dump command produces only the output for the file types that make sense, instead of blindly dumping the results of all modules. What I don't like is that this functionality is embedded in yara-x, by means of the having to declare in the .proto a field that determines whether the output is valid or not.
I believe that all those checks also correspond to the CLI tool itself, not to the yara-x library.

I am not sure what is the dependancy to yara-x right now. The module_is_valid() function takes protobuf and just checks the flag. I am not sure what would be another option. Without this flag I can manually traverse the protobuf and search for some kind of field that would tell me "ok this is valid" but as it differs across the modules it would be a bit complex and with a new module there is a chance that something would have to be added into the code.

What I mean with dependency to yara-x is that the CLI tool, in order to determine which module it should include in the output, is relying on the validity flag provided by the yara-x library itself. This validity logic is something that probably only the CLI will use, and its better implemented at the CLI level. The problem you have now is that it's imposible to inspect the result produced by each module at the CLI level, and therefore you don't have access to the is_pe or is_lnk fields, but that's one of the problems I plan to solve in the API I'm working on.

@plusvic
Copy link
Member

plusvic commented Nov 10, 2023

Take a look at this PR: #52

It introduces two new functions that allow invoking a module without having to create a dummy YARA rule. Basically you would to something like:

let elf_info = yara_x::mods::invoke_mod::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod::<yara_x::mods::Lnk>(data)?;

With that you will get the Rust structure corresponding to each module. There's another version that returns Box<&dyn MessageDyn>

let elf_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::Lnk>(data)?;

@TommYDeeee
Copy link
Contributor Author

Take a look at this PR: #52

It introduces two new functions that allow invoking a module without having to create a dummy YARA rule. Basically you would to something like:

let elf_info = yara_x::mods::invoke_mod::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod::<yara_x::mods::Lnk>(data)?;

With that you will get the Rust structure corresponding to each module. There's another version that returns Box<&dyn MessageDyn>

let elf_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::Lnk>(data)?;

Thank you, I have just pushed the latest version which makes an use of this new API and should incorporate all of you suggestions.

Copy link
Member

@plusvic plusvic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been reviewing this PR more in-depth and found some other issues. The most important issue is that nested structures are not rendered correctly. For example, if I add another nested structure to OptionalNested, like this...

message OptionalNested {
  optional uint32 onested1 = 1;
  optional uint64 onested2 = 2 [(dumper.field_options).yaml_fmt = "x"];
  map<string, string> map_string_string = 3;
  optional Nested2 nested = 4;
}

message Nested2 {
  optional string foo = 1;
}

The produced YAML is not correct:

field1: 0x7b
field2: "test"
field3: "test\ntest"
segments:
  - nested1: 456
    nested2: 0x315
    timestamp: 123456789 # 1973-11-29 21:33:09 UTC
  - nested1: 100000
    nested2: 0x30d40
    timestamp: 999999999 # 2001-09-09 01:46:39 UTC
optional:
onested1: 123
    onested2: 0x1c8
    map_string_string:
        "foo\nfoo": "bar\nbar"
    nested:
foo: "foo"

Notice that the foo field (which belongs to nested) is not correctly aligned under nested. In the light of these findings I decided to give it a try and draft my own implementation for the protobuf to YAML serializer:

https://github.com/VirusTotal/yara-x/commits/yaml-serializer

My implementation is only an incomplete draft, but it solves the issue with nested structures and tries to address some other problems with the API design. For example, using a writer instead of a String for obtaining the final YAML. This is a more flexible that can be used for outputting the YAML directly to a file without having to allocate a string containing the whole YAML.

yara-x-dump/src/serializer.rs Show resolved Hide resolved
yara-x-dump/src/serializer.rs Show resolved Hide resolved
yara-x-dump/src/serializer.rs Show resolved Hide resolved
@TommYDeeee TommYDeeee mentioned this pull request Nov 22, 2023
@TommYDeeee
Copy link
Contributor Author

Hi Victor, thank you for you review. I have addressed your issues and added all features into your template in #53. We can continue to discuss it there. Thank you!

@TommYDeeee
Copy link
Contributor Author

Merged in #53

@TommYDeeee TommYDeeee closed this Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants