Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: improve implementation of regex repetitions #221

Merged
merged 7 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions lib/src/compiler/tests/testdata/errors/138.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
rule test {
strings:
$a = /abcd((efg){0,10000}){0,10000}/
condition:
$a
}
6 changes: 6 additions & 0 deletions lib/src/compiler/tests/testdata/errors/138.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
error[E014]: invalid regular expression
--> line:3:3
|
3 | $a = /abcd((efg){0,10000}){0,10000}/
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ regexp is too large
|
200 changes: 113 additions & 87 deletions lib/src/re/bitmapset.rs
Original file line number Diff line number Diff line change
@@ -1,109 +1,126 @@
use bitvec::vec::BitVec;
use rustc_hash::FxHashSet;
use std::hash::Hash;

/// A high-performance set of `usize` values.
/// A high-performance set of (`usize`, T) pairs.
///
/// As in any set, the values are guaranteed to be unique, the `insert`
/// operation is a no-op if the new value already exists in the set.
/// Additionally, this type supports iterating the values in insertion order.
/// As in any set, the pairs are guaranteed to be unique, the `insert`
/// operation is a no-op if the new pair already exists in the set.
/// Additionally, this type supports iterating the pairs in insertion order.
///
/// The distinguishing feature of this set lies in its utilization of bitmaps
/// for efficient membership checks. However, practical limitations prevent
/// having a bitmap with one bit per possible `usize` value, spanning from 0 to
/// `usize::MAX`. Instead, positions in the bitmap are determined relative to
/// the initial value inserted in the set. For instance, if the first value is
/// `1234`, the first bitmap bit corresponds to `1234`, the second to `1235`,
/// the third to `1236`, and so on. A separate bitmap is maintained for values
/// lower than the initial one, with `1233` represented as the first bit in
/// this other bitmap. Both bitmaps dynamically expand to accommodate newly
/// inserted values.
/// for checking if the `usize` key in a pair already exists in the set.
/// However, practical limitations prevent having a bitmap with one bit per
/// possible `usize` value, spanning from 0 to `usize::MAX`. Instead, positions
/// in the bitmap are determined relative to the initial key inserted in the
/// set. For instance, if the first value is (`1234`, T), the first bitmap bit
/// corresponds to key `1234`, the second to key `1235`, the third to key
/// `1236`, and so on. A separate bitmap is maintained for keys lower than
/// the initial one, with `1233` represented as the first bit in this other
/// bitmap. Both bitmaps dynamically expand to accommodate newly inserted
/// values.
///
/// `BitmapSet` works well with values that are close to each other. Outliers
/// `BitmapSet` works well with keys that are close to each other. Outliers
/// can make the memory required for storing the bitmaps to grow very quickly.
/// Another property of this type is that values inserted in the set can be
/// iterated in insertion order.
#[derive(Debug, PartialEq, Default)]
pub(crate) struct BitmapSet {
// Vector that contains the values in the set, in insertion order.
values: Vec<usize>,
// First value inserted in the set.
initial_value: usize,
// Bitmap for values that are > initial_value.
#[derive(Debug, Default)]
pub(crate) struct BitmapSet<T>
where
T: Default + Copy + PartialEq + Eq + Hash,
{
// Vector that contains the (key,value) pairs in the set, in insertion
// order.
items: Vec<(usize, T)>,
// Set that contains the (key,value) pairs.
set: FxHashSet<(usize, T)>,
// Bitmap for keys that are > initial_key.
p_bitmap: BitVec<usize>,
// Bitmap for values that are < initial_value.
// Bitmap for keys that are < initial_key.
n_bitmap: BitVec<usize>,
}

impl BitmapSet {
impl<T> BitmapSet<T>
where
T: Default + Copy + PartialEq + Eq + Hash,
{
pub const MAX_OFFSET: usize = 524288;

pub fn new() -> Self {
Self {
values: Vec::new(),
initial_value: 0,
items: Vec::new(),
set: FxHashSet::default(),
p_bitmap: BitVec::repeat(false, 1024),
n_bitmap: BitVec::repeat(false, 1024),
}
}

/// Adds a value to the set.
/// Adds a (key,value) pair to the set.
///
/// Returns `true` if the value didn't exist in the set and was added, and
/// `false` if the value already existed.
/// Returns `true` if the (key,value) pair didn't exist in the map and was
/// added, and `false` if the pair already existed.
///
/// # Panics
///
/// If `value` is too far from the first value added to the set.
/// Specifically, it panics when `abs(value - initial_value) >= MAX_OFFSET`
/// If `key` is too far from the first key added to the set.
/// Specifically, it panics when `abs(key - initial_key) >= MAX_OFFSET`
///
#[inline]
pub fn insert(&mut self, value: usize) -> bool {
// Special case when the set is totally empty.
if self.values.is_empty() {
self.initial_value = value;
self.values.push(value);
return true;
}
// Special case where the new value is equal to the first value
// added to the set. We don't need to spare a bit on this value.
if self.initial_value == value {
pub fn insert(&mut self, key: usize, value: T) -> bool {
let first = match self.items.first() {
Some(first) => first,
None => {
// The set is empty, store the first item and return.
self.items.push((key, value));
return true;
}
};

// Special case when the new (key,value) pair is equal to the
// first one added to the set.
if first.0 == key && first.1 == value {
return false;
}

let offset = value as isize - self.initial_value as isize;
let offset = key as isize - first.0 as isize;

match offset {
offset if offset < 0 => {
let offset = -offset as usize;
let offset = (-offset as usize) - 1;
unsafe {
if self.n_bitmap.len() <= offset {
assert!(offset < Self::MAX_OFFSET);
self.n_bitmap.resize(offset + 1, false);
self.n_bitmap.set_unchecked(offset, true);
self.values.push(value);
true
self.items.push((key, value));
self.set.insert((key, value))
} else if !*self.n_bitmap.get_unchecked(offset) {
self.n_bitmap.set_unchecked(offset, true);
self.values.push(value);
self.items.push((key, value));
self.set.insert((key, value))
} else if self.set.insert((key, value)) {
self.items.push((key, value));
true
} else {
false
}
}
}
offset => {
// At this point `offset` cannot be zero, it's safe to subtract
// 1 so that the first bit in the `p_bitmap` is used.
let offset = offset as usize - 1;
let offset = offset as usize;
unsafe {
if self.p_bitmap.len() <= offset {
assert!(offset < Self::MAX_OFFSET);
self.p_bitmap.resize(offset + 1, false);
self.p_bitmap.set_unchecked(offset, true);
self.values.push(value);
true
self.items.push((key, value));
self.set.insert((key, value))
} else if !*self.p_bitmap.get_unchecked(offset) {
self.p_bitmap.set_unchecked(offset, true);
self.values.push(value);
self.items.push((key, value));
self.set.insert((key, value))
} else if self.set.insert((key, value)) {
self.items.push((key, value));
true
} else {
false
Expand All @@ -115,39 +132,35 @@ impl BitmapSet {

#[inline]
pub fn is_empty(&self) -> bool {
self.values.is_empty()
self.items.is_empty()
}

/// Removes all values in the set.
#[inline]
pub fn clear(&mut self) {
for thread in self.values.drain(0..) {
let offset = thread as isize - self.initial_value as isize;
let first_key = match self.items.first() {
Some(first) => first.0,
None => return,
};
for (key, _) in self.items.drain(0..) {
let offset = key as isize - first_key as isize;
match offset {
offset if offset > 0 => {
self.p_bitmap.set((offset - 1) as usize, false);
}
offset if offset < 0 => {
self.n_bitmap.set((-offset) as usize, false);
self.n_bitmap.set(((-offset) as usize) - 1, false);
}
_ => {
// when `offset` is 0 there's no bit to clear, the initial
// value doesn't have a bit in neither of the bitmaps.
offset => {
self.p_bitmap.set(offset as usize, false);
}
}
}
self.set.clear();
}

/// Returns an iterator for the items in the set.
///
/// Items are returned in insertion order.
pub fn iter(&self) -> impl Iterator<Item = &usize> {
self.values.iter()
}

#[cfg(test)]
pub fn into_vec(self) -> Vec<usize> {
self.values
pub fn iter(&self) -> impl Iterator<Item = &(usize, T)> {
self.items.iter()
}
}

Expand All @@ -159,28 +172,41 @@ mod tests {
fn thread_set() {
let mut s = BitmapSet::new();

assert!(s.insert(4));
assert!(s.insert(2));
assert!(s.insert(10));
assert!(s.insert(0));
assert!(s.insert(2000));

assert!(!s.insert(4));
assert!(!s.insert(2));
assert!(!s.insert(10));
assert!(!s.insert(0));
assert!(!s.insert(2000));

assert_eq!(s.values, vec![4, 2, 10, 0, 2000]);
assert!(s.insert(4, 0));
assert!(s.insert(2, 0));
assert!(s.insert(3, 0));
assert!(s.insert(10, 0));
assert!(s.insert(0, 0));
assert!(s.insert(2000, 0));

assert!(!s.insert(4, 0));
assert!(!s.insert(2, 0));
assert!(!s.insert(3, 0));
assert!(!s.insert(10, 0));
assert!(!s.insert(0, 0));
assert!(!s.insert(2000, 0));
assert!(s.insert(4, 1));
assert!(!s.insert(4, 1));

assert_eq!(
s.items,
vec![(4, 0), (2, 0), (3, 0), (10, 0), (0, 0), (2000, 0), (4, 1)]
);

s.clear();

assert!(s.insert(200));
assert!(s.insert(2));
assert!(s.insert(10));
assert!(s.insert(300));
assert!(s.insert(250));
assert_eq!(s.p_bitmap.count_ones(), 0);
assert_eq!(s.n_bitmap.count_ones(), 0);

assert!(s.insert(200, 0));
assert!(s.insert(3, 0));
assert!(s.insert(10, 0));
assert!(s.insert(300, 0));
assert!(s.insert(250, 0));

assert_eq!(s.values, vec![200, 2, 10, 300, 250]);
assert_eq!(
s.items,
vec![(200, 0), (3, 0), (10, 0), (300, 0), (250, 0)]
);
}
}
Loading
Loading