Skip to content

Commit

Permalink
doc: improve documentation for the re module.
Browse files Browse the repository at this point in the history
  • Loading branch information
plusvic committed Sep 3, 2023
1 parent 70aab17 commit e970c9e
Show file tree
Hide file tree
Showing 4 changed files with 63 additions and 20 deletions.
29 changes: 16 additions & 13 deletions yara-x/src/re/fast/compiler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@ use crate::re;
use crate::re::fast::instr::Instr;
use crate::re::{BckCodeLoc, Error, FwdCodeLoc, RegexpAtom};

/// A compiler that takes a [`re::hir::Hir`] and produces code for the
/// VM represented by [`re::fast::FastVM`].
///
/// This compiler accepts only a subset of the regular expressions.
///
/// A compiler that takes a [`re::hir::Hir`] and produces code for
/// [`re::fast::FastVM`].
pub(crate) struct Compiler {}

impl Compiler {
/// Creates a new compiler.
pub fn new() -> Self {
Self {}
}

/// Compiles the regular expression represented by the given [`Hir`]
/// and appends the produced code to a vector.
pub fn compile(
mut self,
hir: &re::hir::Hir,
Expand Down Expand Up @@ -214,15 +214,16 @@ impl Compiler {

/// Represents the pieces in which patterns are decomposed during compilation.
///
/// Patterns accepted by the Fast VM can be decomposed into a sequence of
/// pieces where each piece is either a literal, a masked literal, or a jump.
/// For example, the pattern `{ 01 02 03 [0-2] 04 0? 06 }` is decomposed into
/// the sequence:
/// Patterns accepted by the Fast VM can be decomposed into a sequence of pieces
/// where each piece is either a literal, a masked literal, an alternation, or a
/// jump.
///
/// For instance, the pattern `{ 01 02 03 [0-2] 04 0? 06 }` is decomposed into:
///
/// ```text
/// Literal([01, 02, 03])
/// Pattern(Literal([01, 02, 03]))
/// Jump(0,2)
/// MaskedLiteral([04, 00, 06], [FF, F0, FF])
/// Pattern(Masked([04, 00, 06], [FF, F0, FF]))
/// ```
enum PatternPiece {
Pattern(Pattern),
Expand All @@ -236,7 +237,8 @@ enum Pattern {
Masked(Vec<u8>, Vec<u8>),
}

/// Given the HIR for a regexp pattern, decomposed it in [`PatternPiece`]s.
/// Given the [`Hir`] for a regexp pattern, decomposed it into
/// [`PatternPiece`]s.
struct PatternSplitter {
bytes: Vec<u8>,
mask: Vec<u8>,
Expand Down Expand Up @@ -400,7 +402,8 @@ impl Visitor for PatternSplitter {
}
}

/// A sequence of instructions for the Fast VM.
/// Helper type for emitting a sequence of instructions for
/// [`re::fast::fastvm::FastVM`].
#[derive(Default)]
struct InstrSeq {
seq: Cursor<Vec<u8>>,
Expand Down
15 changes: 10 additions & 5 deletions yara-x/src/re/fast/fastvm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,24 @@ use memx::memeq;
use crate::re::fast::instr::{Instr, InstrParser};
use crate::re::{Action, CodeLoc, DEFAULT_SCAN_LIMIT};

/// Represents a faster alternative to [crate::re::thompson::pikevm::PikeVM]
/// A faster but less general alternative to [PikeVM].
///
/// A FastVM is similar to a PikeVM, but it is limited to a subset of the
/// regular expressions.
/// `FastVM` is a virtual machine that executes bytecode that evaluates
/// regular expressions, similarly to [PikeVM]. `FastVM` is faster, but
/// only supports a subset of the regular expressions supported by [PikeVM]
/// (see the more details in the [`crate::re::fast`] module's documentation).
///
/// TODO: finish
/// [PikeVM]: crate::re::thompson::pikevm::PikeVM
pub(crate) struct FastVM<'r> {
/// The code for the VM. Produced by [`crate::re::fast::Compiler`].
code: &'r [u8],
/// Maximum number of bytes to scan. The VM will abort after ingesting
/// this number of bytes from the input.
scan_limit: usize,
/// A set with all the positions currently tracked.
/// A set with all the positions within the data that are matching so
/// far. `IndexSet` is used instead of `HashSet` because insertion order
/// needs to be maintained while iterating the positions and `HashSet`
/// doesn't make any guarantees about iteration order.
positions: IndexSet<usize>,
}

Expand Down
36 changes: 36 additions & 0 deletions yara-x/src/re/fast/mod.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,39 @@
/*! This module implements [FastVM], a faster but less general alternative
to [PikeVM], accompanied by a compiler designed to generate code for it.
[FastVM] closely resembles [PikeVM], albeit with certain limitations. It
exclusively supports regular expressions adhering to the following rules:
- No repetitions are allowed, except when the repeated pattern is any byte. So,
`.*` and `.{1,3}` are permitted, but `a*` and `a{1,3}` are not.
- Character classes are disallowed unless they can be represented as masked
bytes. For example, `[a-z]` is not supported, but `[Aa]` is, as it can be
expressed as `0x41` masked with `0x20` (where `0x41` corresponds to `A`,
and applying the mask `0x20` yields `0x61`, representing `a`).
- Alternatives are accepted, provided that the options consist only of literals
or character classes equivalent to masked bytes. For example, `(foo|bar)`
is supported because both options are literals, and `[Ff]oo|[Bb]ar` is also
supported since the byte classes can be expressed as masked bytes.
- Nested alternations are not permitted.
Most regular expressions derived from YARA hex patterns (which are simply a
subset of regular expressions), are compatible with [FastVM], except when they
contain alternations that contain variable length jumps
(e.g: `{ (01 02 03 [1-4] 05 | 06 07 08) }`).
Many standard regular expressions also work with [FastVM].
YARA prioritizes compiling regular expressions for [FastVM] and only resorts
to [PikeVM] if the compilation fails due to incompatible constructs in the
regular expression.
[FastVM]: crate::re::thompson::fastvm::FastVM
[PikeVM]: crate::re::thompson::pikevm::PikeVM
*/

pub(crate) mod fastvm;

mod compiler;
Expand Down
3 changes: 1 addition & 2 deletions yara-x/src/re/thompson/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
/*!
A regexp compiler using the [Thompson's construction][1] algorithm that
/*! A regexp compiler using the [Thompson's construction][1] algorithm that
produces code for the Pike VM described in Russ Cox's article
[Regular Expression Matching: the Virtual Machine Approach][2].
Expand Down

0 comments on commit e970c9e

Please sign in to comment.