feat: YARA-X dumper module #50

TommYDeeee · 2023-11-07T09:10:47Z

Dumping module for YARA-X. This module can be run by yr dump <binary-file> <options>, where options are not required and can be one of: --modules/-m=<module> and --output-format/-o=<format>.
Currently supported output formats are:

json
xml
toml
yaml
human-readable

Supported modules are all built-in modules that are retrieved in build.rs.
This module takes binary file and dumps parsed information to STDOUT. Modules used for parsing can be either selected and forced by using --modules/-m=<module> option or if left empty it is automatically selected via validity_flag. This flag determines if module output is valid and therefore considered as successfully parsed. Validity flag is described in Module Developer's Guide.md. In terms of displayed output various formats are supported. User can specify also multiple modules used for parsing and output of all modules will be shown. Format can be selected via --output-format/-o=<format>. If left empty human-readable format is automatically selected. This format is basically valid YAML with additional colors and comments. This format also supports additional integer representation which can be selected by protobuf field descriptor and is also described in Module Developer's Guide.md.

Together with this I have also added parametrized tests for end2end testing and did a minor change in macho module, where dylib version was represented as an integer. It would make much more sense to have this as a string (last two digits represents patch version, previous two minor version and rest is major version number). cc @latonis

So far I have added validity flag into macho and lnk module. Integer field representation descriptors were added only into macho module. I think module author should do both while developing the module so feel free to add it into other modules that produces output or change it in existing modules.

plusvic · 2023-11-07T12:27:47Z

The API proposed for the yara-x-dump crate is not the most appropriate one in my opinion, as it creates a dependency on the yara-x crate itself, because it needs the compiler and scanner for producing the module's output. It looks like the thought process was: I need to dump the information produced by a list of module names, let's create an API that does exactly that, put it as an independent crate, and call it from yara-cli.

The thought process should be quite the opposite: I need to dump the information produced by a list of module names, Which abstractions or building blocks do I need for implementing this into yara-cli?.

The must obvious abstraction to me is: given a &dyn MessageDyn containing the data and specification for a protobuf message, let's produce a text representation of it in one the supported formats. For the JSON format it's straightforward, as we already have the protobuf_json_mapping crate. For very basic YAML output we could simply use serde_yaml, but we want a human-friendly YAML output that can be controlled from the corresponding .proto files, so that's where the our crate should specialise on.

This is the right abstracion: a crate that is able to produce YAML output for &dyn MessageDyn messages produced by the protobuf crate, but allowing customization of the produced YAML via protobuf options. This has potential for becoming a fully standalone crate that anyone can use in its own project!

The YAML protobuf customization options don't need to be in the yara.proto file (in the yara-x-proto crate). They could reside in some other yaml.proto file located in this independent crate, which would allow to implement it as a truly independent crate that doesn't depend on YARA-X at all, not even on the tiny yara-x-proto crate. But for now we could simply use the yara.proto file.

Regarding the protobuf customization options, the boolean hex_value is ok ([(yara.field_options).hex_value = true]), but I'm thinking on a more flexible mechanism, like [(yara.field_options).yaml_fmt = "x"]. Another format option could be "t" for a timestamp, and "f:SomeEnumType" for a fields that are a set of flags represented by a protobuf enum named SomeEnumType. Example:

enum MyFlags {
   Foo = 0x01; 
   Bar = 0x02;
   Baz = 0x04;
}

enum MyStruct {
   MyFlags flags = 1 [(yara.field_options).yaml_fmt = "f:MyFlags"];
}

The YAML output for this field could be:

flags: 0x3   #  Foo | Bar

Notice the additional comment dissecting the flags.

Of course, this crate would also support colorful YAML output using ANSI escape codes, which is something independent from formatting options (we may want a human-friendly YAML output without colors for saving them in a text file)

With these building blocks implementing the yr dump command in the CLI is very easy, but we need to implement the building blocks first, and they should be implemented in way that is as independent from YARA as possible.

plusvic · 2023-11-08T21:53:40Z

Let's go back to the drawing board...

The only thing we really need here is a crate (let's call it yara-x-proto-yaml) that does exactly one thing, take a &dyn MessageDyn and return a String with the message encoded in YAML format. Nothing else.

This crate doesn't need to produce JSON, or anything else, it's just a protobuf -> YAML converter. The two special features we need in this crate are:

Optional colors in the YAML output (no colors by default)
In your .proto files can use options for controlling some aspects of the YAML generation, like whether an integer should be rendered in hex, or whether a field is a timestamp that should be accompanied by a comment with the date and time in human readable format.

Once we have this, the remaining bits are very straightforward.

yara-x-cli/src/commands/dump.rs

yara-x-dump/src/serializer.rs

TommYDeeee · 2023-11-09T14:57:53Z

Let's go back to the drawing board...

The only thing we really need here is a crate (let's call it yara-x-proto-yaml) that does exactly one thing, take a &dyn MessageDyn and return a String with the message encoded in YAML format. Nothing else.

This crate doesn't need to produce JSON, or anything else, it's just a protobuf -> YAML converter. The two special features we need in this crate are:

Optional colors in the YAML output (no colors by default)

In your .proto files can use options for controlling some aspects of the YAML generation, like whether an integer should be rendered in hex, or whether a field is a timestamp that should be accompanied by a comment with the date and time in human readable format.

Once we have this, the remaining bits are very straightforward.

Thank you for suggestions. Basically I agree with everything.

This crate doesn't need to produce JSON, or anything else, it's just a protobuf -> YAML converter

I have removed all other forms of serialization and cleaned the crate to provide only protobuf -> YAML conversion

Optional colors in the YAML output (no colors by default)

This is implemented by yansi::Pait::disable() method that turns off all ANSI colors, it works for both YAML and JSON. I have also implemented colors for json using external library just for this it also pretty-formats the string so even with disabled colors the output is more readable.

In your .proto files can use options for controlling some aspects of the YAML generation, like whether an integer should be rendered in hex, or whether a field is a timestamp that should be accompanied by a comment with the date and time in human readable format.

I have done this as you suggested and used one unified (yara.field_options).yaml_fmt descriptor that takes string as an input. Right now I only look for "x" for hexadecimal and "t" for timestamp, this can be easily extended in the future. Hexadecimal value is used to replace the original integer one and timestamp is printed as a comment next to the integer version. In future I would like to also implement the flags, but at this point I need to figure out how to get an enum with specified name from the code itself, it looks like there is only a getter for fields.

In case you have any other notes/suggestions feel free to let me know.

plusvic · 2023-11-09T16:22:33Z

This is much better now, but it still needs a bit polishing. I'm refactoring a few things to provide a mechanism for passing arbitrary data to any module by name, without having to use the trick of creating a dummy rule that imports that module. Once I have it we will be able to remove that part of the code and make it cleaner.

In the meanwhile I'll review other aspects of the code.

plusvic · 2023-11-09T16:25:04Z

yara-x-cli/src/commands/dump.rs

+            .help("Name of the module or comma-separated list of modules to be used for parsing")
+            .required(false)
+            .action(ArgAction::Append)
+            .value_parser(get_builtin_modules_names()),


Not all modules are able to produce a result for the dump command, for instance, the time module doesn't produce anything. So, including the whole list of built-in modules is not the best option here. We better remove the get_builtin_modules_names functions and include here a fixed list of allowed modules, which should contain only the modules that parse some file format and produce metadata about it.

I have thought about this, but isn't maintaining another list of possible modules unnecessary addition? Right now the modules that do not produce any output are filtered by mod_output.compute_size_dyn() != 0. I am not sure if it is a performance issue, if yes sure I can make a list of just modules that makes sense.

In this case I think it's better to be explicit about the list of modules that are supported by the dump command at the CLI level, instead for taking the whole list of built-in modules from get_builtin_modules_names. There are variety of reasons for that, for example there are modules only for testing, like test_proto2 and test_proto3. Those actually produce results, but we don't want them as possible options for the dump command. Also, we may have at some point modules that are there but are not stable enough for public use. So, it's much more simpler if the possible arguments for the dump command is explicitly controlled where the command is implemented.

plusvic · 2023-11-09T16:40:37Z

yara-x-cli/src/commands/dump.rs

+// # Returns
+//
+// * `true` if the module output is valid, `false` otherwise.
+fn module_is_valid(mod_output: &dyn MessageDyn) -> bool {


I'm not quite convinced about all this logic that determines if the output of a module is valid. It requires that you mark a field in the proto and makes things more complex with too little gain. If you the user uses the command-line for dumping the result of the pe module and he passes a non-PE file, the result will be a very small YAML/JSON where the is_pe field is set to false, that's ok to me.

The "auto" option, which allows to pass only the file and let the CLI figure out which module make sense to be taken into account and which not, is a nice addition, but it can be implemented by putting the logic in the CLI tool itself, instead of having to bake all that validity check into yara-x itself.

The steps the CLI would take are:

Given some &[u8] with the content of the file, ask yara-x to produce the output for module "pe" (or any other module name). This is the logic I'm implementing now, and will return a protobuf.

As the CLI already have access to the protobuf, it the validation logic can be executed by the CLI before passing the protobuf to the serializer of obtaining a text representation of it.

Based in the output format requested, use protobuf_json_mapping::print_to_string or our own YAML serializer.

If you the user uses the command-line for dumping the result of the pe module and he passes a non-PE file, the result will be a very small YAML/JSON where the is_pe field is set to false, that's ok to me.

This is already implemented, if the user manually specifies the module he wants to use. The output will be from given module even when is_pe or any other flag is set to false or not set at all. All complexity comes from automatic selection. Some modules as pe or lnk has a validity flag built-in (is_pe or is_lnk) but macho (or others don't have this option. Therefore it is right now a bit module specific and I wanted some universal solution by marking certain field as a flag. This is unified across all modules and easy to validate. But I am open to another solutions that would not require to add another specification to protobuf itself.

As the CLI already have access to the protobuf, it the validation logic can be executed by the CLI before passing the protobuf to the serializer of obtaining a text representation of it.

I am not sure what is the dependancy to yara-x right now. The module_is_valid() function takes protobuf and just checks the flag. I am not sure what would be another option. Without this flag I can manually traverse the protobuf and search for some kind of field that would tell me "ok this is valid" but as it differs across the modules it would be a bit complex and with a new module there is a chance that something would have to be added into the code.

This is already implemented, if the user manually specifies the module he wants to use. The output will be from given module even when is_pe or any other flag is set to false or not set at all. All complexity comes from automatic selection. Some modules as pe or lnk has a validity flag built-in (is_pe or is_lnk) but macho (or others don't have this option. Therefore it is right now a bit module specific and I wanted some universal solution by marking certain field as a flag. This is unified across all modules and easy to validate. But I am open to another solutions that would not require to add another specification to protobuf itself.

I like the feature. I like that the dump command produces only the output for the file types that make sense, instead of blindly dumping the results of all modules. What I don't like is that this functionality is embedded in yara-x, by means of the having to declare in the .proto a field that determines whether the output is valid or not.
I believe that all those checks also correspond to the CLI tool itself, not to the yara-x library.

I am not sure what is the dependancy to yara-x right now. The module_is_valid() function takes protobuf and just checks the flag. I am not sure what would be another option. Without this flag I can manually traverse the protobuf and search for some kind of field that would tell me "ok this is valid" but as it differs across the modules it would be a bit complex and with a new module there is a chance that something would have to be added into the code.

What I mean with dependency to yara-x is that the CLI tool, in order to determine which module it should include in the output, is relying on the validity flag provided by the yara-x library itself. This validity logic is something that probably only the CLI will use, and its better implemented at the CLI level. The problem you have now is that it's imposible to inspect the result produced by each module at the CLI level, and therefore you don't have access to the is_pe or is_lnk fields, but that's one of the problems I plan to solve in the API I'm working on.

yara-x-dump/src/serializer.rs

yara-x-dump/src/tests/testdata/macho_x86_file.out

plusvic · 2023-11-10T10:26:01Z

Take a look at this PR: #52

It introduces two new functions that allow invoking a module without having to create a dummy YARA rule. Basically you would to something like:

let elf_info = yara_x::mods::invoke_mod::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod::<yara_x::mods::Lnk>(data)?;

With that you will get the Rust structure corresponding to each module. There's another version that returns Box<&dyn MessageDyn>

let elf_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::Lnk>(data)?;

…o dylib versions changed to strings and tests added

… of a string

TommYDeeee · 2023-11-10T16:12:02Z

Take a look at this PR: #52

It introduces two new functions that allow invoking a module without having to create a dummy YARA rule. Basically you would to something like:
let elf_info = yara_x::mods::invoke_mod::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod::<yara_x::mods::Lnk>(data)?;
With that you will get the Rust structure corresponding to each module. There's another version that returns Box<&dyn MessageDyn>
let elf_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::ELF>(data)?;
let lnk_info = yara_x::mods::invoke_mod_dyn::<yara_x::mods::Lnk>(data)?;

Thank you, I have just pushed the latest version which makes an use of this new API and should incorporate all of you suggestions.

plusvic

I've been reviewing this PR more in-depth and found some other issues. The most important issue is that nested structures are not rendered correctly. For example, if I add another nested structure to OptionalNested, like this...

message OptionalNested {
  optional uint32 onested1 = 1;
  optional uint64 onested2 = 2 [(dumper.field_options).yaml_fmt = "x"];
  map<string, string> map_string_string = 3;
  optional Nested2 nested = 4;
}

message Nested2 {
  optional string foo = 1;
}

The produced YAML is not correct:

field1: 0x7b
field2: "test"
field3: "test\ntest"
segments:
  - nested1: 456
    nested2: 0x315
    timestamp: 123456789 # 1973-11-29 21:33:09 UTC
  - nested1: 100000
    nested2: 0x30d40
    timestamp: 999999999 # 2001-09-09 01:46:39 UTC
optional:
onested1: 123
    onested2: 0x1c8
    map_string_string:
        "foo\nfoo": "bar\nbar"
    nested:
foo: "foo"

Notice that the foo field (which belongs to nested) is not correctly aligned under nested. In the light of these findings I decided to give it a try and draft my own implementation for the protobuf to YAML serializer:

https://github.com/VirusTotal/yara-x/commits/yaml-serializer

My implementation is only an incomplete draft, but it solves the issue with nested structures and tries to address some other problems with the API design. For example, using a writer instead of a String for obtaining the final YAML. This is a more flexible that can be used for outputting the YAML directly to a file without having to allocate a string containing the whole YAML.

yara-x-dump/src/serializer.rs

yara-x-dump/src/tests/testdata/macho_x86_file.in

TommYDeeee · 2023-11-22T10:12:15Z

Hi Victor, thank you for you review. I have addressed your issues and added all features into your template in #53. We can continue to discuss it there. Thank you!

TommYDeeee · 2023-11-23T14:31:46Z

Merged in #53

plusvic closed this Nov 7, 2023

plusvic reopened this Nov 7, 2023

plusvic reviewed Nov 8, 2023

View reviewed changes

yara-x-cli/src/commands/dump.rs Outdated Show resolved Hide resolved

yara-x-dump/src/serializer.rs Outdated Show resolved Hide resolved

yara-x-dump/src/serializer.rs Outdated Show resolved Hide resolved

yara-x-dump/src/serializer.rs Outdated Show resolved Hide resolved

plusvic reviewed Nov 9, 2023

View reviewed changes

yara-x-dump/src/serializer.rs Outdated Show resolved Hide resolved

plusvic reviewed Nov 9, 2023

View reviewed changes

yara-x-dump/src/tests/testdata/macho_x86_file.out Outdated Show resolved Hide resolved

plusvic reviewed Nov 9, 2023

View reviewed changes

yara-x-dump/src/tests/testdata/macho_x86_file.out Outdated Show resolved Hide resolved

TommYDeeee added 17 commits November 10, 2023 14:13

yara-x-dumper init

01d38da

add basic output for hardcoded module

829ca56

use all builtin modules for output

85ea33a

user defined modules WiP

2222d9f

user defined modules

fc5945d

multiple serilization formats implemented

67f78d2

toml,xml added, output can be specified with CLI

46b4555

human-readable format WiP

04911a1

human-readable format is similar to YAML

fb111a8

colors and comments in human readable format, timestamp support, mach…

7d5a844

…o dylib versions changed to strings and tests added

added colors for fields, module filtering, protobuf map printing

cd6854b

added custom colors and support for skipping 0x00 bytes in the middle…

c6b38f2

… of a string

code formatting

bb4bc0d

added documentation and tests

58de81a

add module documentation

ec8ca96

remove unnecessary optional marking

e017dbd

clippy warnings fixed

4e15ca5

TommYDeeee added 8 commits November 10, 2023 14:13

remove unused dependancies

624c525

yr dump decoupling

3cdf7ac

yr dump decoupled

aa5ad9d

remove unused dependancies

5e2844c

unnecessary formats removed and code refactored

157bee2

fix clippy warnings

ce66466

fix issues with yaml format

59c6d3c

make use of new module API

2209400

TommYDeeee force-pushed the yara-x-dumper branch from 8c25e68 to 2209400 Compare November 10, 2023 15:44

remove validity check from yara-x and fix clippy warnings

188e30a

plusvic reviewed Nov 14, 2023

View reviewed changes

yara-x-dump/src/serializer.rs Show resolved Hide resolved

yara-x-dump/src/serializer.rs Show resolved Hide resolved

yara-x-dump/src/serializer.rs Show resolved Hide resolved

yara-x-dump/src/tests/testdata/macho_x86_file.in Show resolved Hide resolved

TommYDeeee mentioned this pull request Nov 22, 2023

feat: YAML serializer #53

Merged

TommYDeeee closed this Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: YARA-X dumper module #50

feat: YARA-X dumper module #50

TommYDeeee commented Nov 7, 2023

plusvic commented Nov 7, 2023 •

edited

Loading

plusvic commented Nov 8, 2023 •

edited

Loading

TommYDeeee commented Nov 9, 2023

plusvic commented Nov 9, 2023

plusvic Nov 9, 2023 •

edited

Loading

TommYDeeee Nov 10, 2023

plusvic Nov 10, 2023

plusvic Nov 9, 2023

TommYDeeee Nov 10, 2023

plusvic Nov 10, 2023 •

edited

Loading

plusvic commented Nov 10, 2023

TommYDeeee commented Nov 10, 2023

plusvic left a comment

TommYDeeee commented Nov 22, 2023

TommYDeeee commented Nov 23, 2023

feat: YARA-X dumper module #50

feat: YARA-X dumper module #50

Conversation

TommYDeeee commented Nov 7, 2023

plusvic commented Nov 7, 2023 • edited Loading

plusvic commented Nov 8, 2023 • edited Loading

TommYDeeee commented Nov 9, 2023

plusvic commented Nov 9, 2023

plusvic Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

TommYDeeee Nov 10, 2023

Choose a reason for hiding this comment

plusvic Nov 10, 2023

Choose a reason for hiding this comment

plusvic Nov 9, 2023

Choose a reason for hiding this comment

TommYDeeee Nov 10, 2023

Choose a reason for hiding this comment

plusvic Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

plusvic commented Nov 10, 2023

TommYDeeee commented Nov 10, 2023

plusvic left a comment

Choose a reason for hiding this comment

TommYDeeee commented Nov 22, 2023

TommYDeeee commented Nov 23, 2023

plusvic commented Nov 7, 2023 •

edited

Loading

plusvic commented Nov 8, 2023 •

edited

Loading

plusvic Nov 9, 2023 •

edited

Loading

plusvic Nov 10, 2023 •

edited

Loading