Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Consider replacing lookup tables with match statements or binary search in single byte index #125

Open
john-parton opened this issue Aug 12, 2020 · 0 comments

Comments

@john-parton
Copy link

john-parton commented Aug 12, 2020

The current technique for building the single byte "forward" and "backward" function is to generate lookup tables using gen_index.py

Here's an example generated file: https://github.com/lifthrasiir/rust-encoding/blob/master/src/index/singlebyte/windows_1252.rs

There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.

So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.

/// Some Latin-1 text to test
//
// the first few sentences of the article "An Ghaeilge" from Irish Wikipedia.
// https://ga.wikipedia.org/wiki/An_Ghaeilge
pub static IRISH_TEXT: &'static str =
    "Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar \
     uirthi corruair), agus ceann den dtrí cinn de theangacha Ceilteacha ar a dtugtar na \
     teangacha Gaelacha (.i. an Ghaeilge, Gaeilge na hAlban agus Gaeilge Mhanann) go háirithe. \
     Labhraítear in Éirinn go príomha í, ach tá cainteoirí Gaeilge ina gcónaí in áiteanna eile ar \
     fud an domhain. Is í an teanga náisiúnta nó dhúchais agus an phríomhtheanga oifigiúil i \
     bPoblacht na hÉireann í an Ghaeilge. Tá an Béarla luaite sa Bhunreacht mar theanga oifigiúil \
     eile. Tá aitheantas oifigiúil aici chomh maith i dTuaisceart Éireann, atá mar chuid den \
     Ríocht Aontaithe. Ar an 13 Meitheamh 2005 d'aontaigh airí gnóthaí eachtracha an Aontais \
     Eorpaigh glacadh leis an nGaeilge mar theanga oifigiúil oibre san AE";

pub static RUSSIAN_TEXT: &'static str =
    "Ру?сский язы?к Информация о файле слушать)[~ 3] один из восточнославянских языков, \
     национальный язык русского народа. Является одним из наиболее распространённых языков мира \
     шестым среди всех языков мира по общей численности говорящих и восьмым по численности \
     владеющих им как родным[9]. Русский является также самым распространённым славянским \
     языком[10] и самым распространённым языком в Европе ? географически и по числу носителей \
     языка как родного[7]. Русский язык ? государственный язык Российской Федерации, один из \
     двух государственных языков Белоруссии, один из официальных языков Казахстана, Киргизии и \
     некоторых других стран, основной язык международного общения в Центральной Евразии, в \
     Восточной Европе, в странах бывшего Советского Союза, один из шести рабочих языков ООН, \
     ЮНЕСКО и других международных организаций[11][12][13].";


#[bench]
fn bench_encode_irish(bencher: &mut test::Bencher) {
    bencher.bytes = IRISH_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.encode(&ASCII_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_irish(bencher: &mut test::Bencher) {
    let bytes = WINDOWS_1252.encode(IRISH_TEXT, EncoderTrap::Strict).unwrap();
    
    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_encode_russian(bencher: &mut test::Bencher) {
    bencher.bytes = RUSSIAN_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.encode(&RUSSIAN_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_russian(bencher: &mut test::Bencher) {
    let bytes = ISO_8859_5.encode(RUSSIAN_TEXT, EncoderTrap::Strict).unwrap();
    
    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

I picked the windows-1252 encoding because it's similar to the old latin-1 standard and can encode the special characters in the Irish text I grabbed, and iso-8859-5 for similar reasons for the Russian test.

I rewrote gen_index.py to create match statements instead of building a lookup table. You get something like this:

// AUTOGENERATED FROM index-windows-1252.txt, ORIGINAL COMMENT FOLLOWS:
//
// For details on index index-windows-1252.txt see the Encoding Standard
// https://encoding.spec.whatwg.org/
//
// Identifier: e56d49d9176e9a412283cf29ac9bd613f5620462f2a080a84eceaf974cfa18b7
// Date: 2018-01-06
#[inline]
pub fn forward(code: u8) -> Option<u16> {
    match code {
        128 => Some(8364),
        129 => Some(129),
        130 => Some(8218),
        131 => Some(402),
        132 => Some(8222),
        133 => Some(8230),
        134 => Some(8224),
        135 => Some(8225),
        136 => Some(710),
        137 => Some(8240),
        //  a bunch more items
        250 => Some(250),
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

#[inline]
pub fn backward(code: u32) -> Option<u8> {
    match code {
        8364 => Some(128),
        129 => Some(129),
        8218 => Some(130),
        402 => Some(131),
        8222 => Some(132),
        8230 => Some(133),
        8224 => Some(134),
        8225 => Some(135),
        710 => Some(136),
        8240 => Some(137),
        352 => Some(138),
        8249 => Some(139),
        338 => Some(140),
        141 => Some(141),
        381 => Some(142),
        //  a bunch more items
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

Note that I changed the function signature to return an Option instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.

I also generated a version that uses a binary search. It's pretty simple.

const BACKWARD_KEYS: &'static [u32] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 215, 247, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499,
    1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 8206,
    8207, 8215
];

const BACKWARD_VALUES: &'static [u8] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 170, 186, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
    238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 253, 254, 223
];

#[inline]
pub fn backward(code: u32) -> u8 {
    if let Ok(index) = BACKWARD_KEYS.binary_search(&code) {
        BACKWARD_VALUES[index]
    } else {
        0
    }
}

Here's a table comparing the three techniques (scroll to see entire table):

test master       match         binary search        
codec::singlebyte::tests::bench_decode_irish 3246 ns/iter 240 MB/s 3171 ns/iter 245 MB/s 2.08%          
codec::singlebyte::tests::bench_decode_russian 8508 ns/iter 98 MB/s 8890 ns/iter 94 MB/s -4.08%          
codec::singlebyte::tests::bench_encode_irish 2622 ns/iter 310 MB/s 1688 ns/iter 482 MB/s 55.48% 2243 ns/iter 363 MB/s 17.10%
codec::singlebyte::tests::bench_encode_russian 6692 ns/iter 228 MB/s 10578 ns/iter 144 MB/s -36.84% 10019 ns/iter 152 MB/s -33.33%

Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.

It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.

I'll try running some more tests.

@john-parton john-parton changed the title Performance: Lookup Performance: Consider replacing lookup tables with match statements in single byte index Aug 12, 2020
@john-parton john-parton changed the title Performance: Consider replacing lookup tables with match statements in single byte index Performance: Consider replacing lookup tables with match statements or binary search in single byte index Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant