Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete rewrite #16

Open
wants to merge 26 commits into
base: devel
Choose a base branch
from
Open

Complete rewrite #16

wants to merge 26 commits into from

Conversation

dvtomas
Copy link

@dvtomas dvtomas commented Oct 5, 2018

Implement #15

@przygienda
Copy link
Collaborator

All right, reviewing the diff now

  1. quote: "In all cases, make decisions based only on the LOGGER keys and values, don't take the message KVs into account" ? all the KV should be taken into account, i.e. inheritance of logs implemented. It seems you're doing it down? mistype?
  2. the stuff you wrote will be just way too slow @ scale. All Matches has to be way more efficient, i.e. first do intersection of keys and the keys are not present not even both to walk all those lists you are building.
  3. anynonmatch has to be implemented as well.

Basically you have collapse the logic you build for AllMatches logic to have

Hashkey(allkeys that need match) and interesect with keys available => if not all keys around, just fail. If keys here, then keep hashset of all values & go look for presence in it. this will take all the orders to LOG complexity instead of naive linear complexity you have everywhere

implement the NotAnyMatch in same way, i.e.

Hashkey(keys that should not match) and interesect with keys available => resulting keys need same set lookup on values.

rest is good addition IMO, the not/or/and can be a nice way to generically combine stuff. Allow possibly a futures implementation even over time to get parallelization on evaluation.

@dvtomas
Copy link
Author

dvtomas commented Oct 7, 2018

ad 1.

Yeah, this particular sentence was copy-pasted from test_complex_example_2. It describes the effect of EvaluationOrder::LoggerOnly, which was not actually used in the module-level example, you're right it didn't make sense without the context. I've added the construction of kv_filter in d19f5cd to the example to demonstrate it, so it should make sense now. Log inheritance is implemented, as shown in the tests.

ad 2.

This is complicated.

First of all, but slightly unrelated:

I'm still not sure about the semantics of your original implementation, as I found the comments kind of confusing. Does the filter accept when, for just one key, all the values match, or any value matches? I've implemented the "all values match" and you don't seem to disagree after the code review, but after looking into your original implementation again, it seems to me that maybe the intent is actually "any value matches"? Can you please state what behavior you need?

Now to the actual problem:

I do agree my solution doesn't scale. On the other hand, for simple filters, it will be more performant as it avoids all the bookkeeping with allocating HashMaps/Sets, computing hashes etc. I'm not sure if there will be usages more complicated than yours. That is why I specifically asked you about your typical logging scenario. You said you have 5-6 keys, and I presumed that a linear find of a string in 5-6 strings wouldn't be much slower - if slower at all - than having to compute the hash of the string and doing a HashSet lookup. Most of the times, strcmp of two strings will fail on the first character, as alphabet has 26 different characters. But if you use a HashSet lookup, you have to use all the key characters to compute the hash.

When the appropriate key is found, looking for the value in the MatchAllValues HashSet is O(1), that
one couldn't be made any (asymptotically) faster and that is also exactly the same as what you did in your implementation. You said there are > 10 values, so using a HashSet probably makes sense here, and that is also what I've implemented.

There will be some overhead associated with allocating the intermediate structures both with your implementation and my implementation, and various possibilities to mitigate it. I guess my structures have a bigger overhead for more complicated filters as there's more mallocing going on, but that would need to be measured to be sure. I also have some ideas on how to make the structures less alloc hungry, if that indeed proves to be a problem.

I think at this point it would be too bold to suggest any one implementation to "overperform them all", as different scenarios ask for different "best" solutions. We could seriously complicate the implementation while actually harming the performance for the common use-cases.

If you do seriously care about performance, I suggest that you implement a criterion benchmark for you original implementation, with the scenario you want to benchmark. IIRC that would be 5-6 keys, 10 values for each, carefully selecting the keys and values to have similar characteristics to the ones you use in production with respect to length and comparison performance (that is, if your production keys are different words like "car", "mouse", "remote", don't use keys like "key1", "key2", "key3" in the benchmark as those take longer to compare. Try to use string lengths similar to the ones you use in production). I'll then port the benchmark to the new implementation so we can compare the performances based on hard data and not just speculations. If there is indeed a serious performance regression, we'll try to find ways to make it better.

ad 3

Ok, let's solve that after 2. is resolved.

@przygienda
Copy link
Collaborator

Hey Tomas, so I rewrote the comment (or actually I think maybe even David wrote it @ the end.

/// * a key is ignored until present in filters, otherwise an entry must
/// match for all the keys present in filters for any of the values given
/// for the key to pass the filter.
/// * an entry that hits any value of any negative filter key is filtered, this
/// takes precedence over filters

should be really clear. filter takes set of keys, each holding seet of values and only passes if all keys have at least one matching value

neg_filter takes same but rejects if any of the keys have at least one matching value

feel free to add according criterion on the main and submit another pull and then I agree, good idea to run both over the criterion ... yes, scale of about 5-6 keys with 5-6 values each is about the envelope I think would be good to test. And I hear you, having a pre-compute hash of keys in front (I didn't see that in your implementation when looking @ the diff) to get the set of keys that need testing is good and then I see the value of what you say, i.e. strcmp being fast compared to compute hashes. Let's look @ perf and then maybe we can fall back on something like prefix trees to try things ...

@dvtomas
Copy link
Author

dvtomas commented Oct 7, 2018

Ok, I was kind of hoping you could do the Criterion part for the master and I then migrate it into complete-rewrite, but if you don't want to do it i guess I can do both, I've used Criterion before. I've got some pressing work to do, hope I'll get back to this on Wednesday or Thursday...

@przygienda
Copy link
Collaborator

unfortunately, same here, I wouldn't get to it to couple weeks. If you don't find time I'll do it then ...

@dvtomas
Copy link
Author

dvtomas commented Oct 14, 2018

OK, finally got to do it.

I've implemented the benchmark on both the original KVFilter (see here: https://github.com/dvtomas/kvfilter/tree/criterion), and the complete-rewrite branch (see here: see here: https://github.com/dvtomas/kvfilter/tree/complete-rewrite)

I've implemented a simple AND benchmark with just two KVs, and a more complicated one with

  • positive filter with 5 keys 5 values each
  • negative filter with 2 keys 5 values each

The results for the original:

Benchmarking simple AND
Benchmarking simple AND: Warming up for 3.0000 s
Benchmarking simple AND: Collecting 100 samples in estimated 5.0021 s (8.2M iterations)
Benchmarking simple AND: Analyzing
simple AND              time:   [610.17 ns 611.03 ns 611.91 ns]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

Benchmarking przygienda
Benchmarking przygienda: Warming up for 3.0000 s
Benchmarking przygienda: Collecting 100 samples in estimated 5.0132 s (1.1M iterations)
Benchmarking przygienda: Analyzing
przygienda              time:   [4.5742 us 4.5826 us 4.5914 us]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

and for complete-rewrite


Benchmarking simple AND
Benchmarking simple AND: Warming up for 3.0000 s
Benchmarking simple AND: Collecting 100 samples in estimated 5.0000 s (5.1M iterations)
Benchmarking simple AND: Analyzing
simple AND              time:   [962.23 ns 963.43 ns 964.67 ns]
                        change: [+57.292% +57.720% +58.192%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking przygienda
Benchmarking przygienda: Warming up for 3.0000 s
Benchmarking przygienda: Collecting 100 samples in estimated 5.0348 s (485k iterations)
Benchmarking przygienda: Analyzing
przygienda              time:   [10.387 us 10.406 us 10.426 us]
                        change: [+126.40% +127.04% +127.70%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low mild
  1 (1.00%) high severe

So, yes, the performance for your use-case has regressed, from roughly 4.5 us to 10.5 us per three log messages for your use case. In my opinion, that's not too bad, and I really like the new opportunities the new KVFilter has (basically, for me, it turns something that just can't do what I need into something that's actually useful to me).

What do you think?

@dvtomas
Copy link
Author

dvtomas commented Oct 18, 2018

@przygienda

So, Tony, did you take a look at the results?

@przygienda
Copy link
Collaborator

thomas, on my plate, after this weekend. sorry, drafts to submit & code to chuck ...

@dvtomas
Copy link
Author

dvtomas commented Oct 31, 2018

@przygienda Hey, Tony, still busy coding? Any chance for a moment of free time to take peek at the benchmarks in the near future?

@przygienda
Copy link
Collaborator

przygienda commented Oct 31, 2018 via email

@seanpianka
Copy link

Can this PR be merged?

@dvtomas
Copy link
Author

dvtomas commented Mar 10, 2021

I've been using this branch personally for the last two years to my satisfaction, but I guess it is a breaking change (different API), and the performance might be somewhat worse for some specific cases. IMO the much more powerful design makes up for it, but opinions of others on whether it is worth the change may differ.

@przygienda
Copy link
Collaborator

przygienda commented Mar 10, 2021 via email

@seanpianka
Copy link

seanpianka commented Mar 10, 2021

@dvtomas Do you have time to update the documentation, or some basic examples, for how to use your branch?

I've been trying to setup a KVFilter to discard messages where a key is present and contains "bad" values, or if the key is not present... some semantics like this:

// discard/filter out before logging to drain
info!(logger, ""; "err" => None);
// discard/filter out before logging to drain
info!(logger, ""; "err" => ""); 
// discard/filter out before logging to drain
info!(logger, ""; "key" => "value"); 

// do not discard, log to the drain
info!(logger, ""; "err" => "error!!!");

However, I haven't been able to figure out how to achieve this with standard KVFilter. How would I express these logging semantics using your branch, @dvtomas?

Edit: Here's the current KVFilter code I've been using to no success:

let filter = KVFilter::new(drain, Level::Debug)
    .always_suppress_any(Some(
        vec![(
            "err".to_string(),
            HashSet::from_iter(vec!["None".to_string(), "".to_string()]),
        )]
        .into_iter()
        .collect(),
    ))
    .only_pass_any_on_all_keys(Some(vec![("err".to_string(), HashSet::new())].into_iter().collect()));

@dvtomas
Copy link
Author

dvtomas commented Mar 10, 2021

@seanpianka

The only documentation for now is the unit tests, let me know if they are too difficult to understand.

There is currently no filter with the semantics "Key exists". That is what you want, because with it you would be able to express your filter as (pseudocode) "Pass message if key_exists(key) && not(key_matches(key, bad_value)) && not(key_matches(key, ""))".

The only missing piece is key_exists, the rest is already implemented. If you are really interested, I can implement key_exists for you and provide a complete example, it should be fairly simple to do that as the new architecture is easily extensible.

@seanpianka
Copy link

seanpianka commented Mar 10, 2021

There is currently no filter with the semantics "Key exists".

I think this is the major piece I'm looking for, @dvtomas. I don't know if @przygienda has plans to implement this, but I'd be open to using either the main project or your fork if you're willing/have the spare time to implement something like key_exists(key) for the filter.

The only documentation for now is the unit tests, let me know if they are too difficult to understand.

I wouldn't say they're too difficult to understand, but I just have a hard time/have to spend a lot of time to extract info. on how to use the library just from reading unit tests... 😅

The only missing piece is key_exists, the rest is already implemented. If you are really interested, I can implement key_exists for you and provide a complete example, it should be fairly simple to do that as the new architecture is easily extensible.

I'd very much appreciate you doing this! key_exists, along with an example of how to use it, would be very useful to me, given the semantics I've been trying (and failing) to achieve.

@przygienda
Copy link
Collaborator

przygienda commented Mar 10, 2021 via email

@seanpianka
Copy link

@przygienda That'd be the only thing! 😁 I'm surprised no one has requested key_exists so far, perhaps the semantics with key_exists is an uncommon use-case?

Anyways, do you want me to open a new issue for this, @przygienda?

@przygienda
Copy link
Collaborator

przygienda commented Mar 10, 2021 via email

@przygienda
Copy link
Collaborator

przygienda commented Mar 10, 2021 via email

@przygienda
Copy link
Collaborator

przygienda commented Mar 10, 2021 via email

@seanpianka
Copy link

seanpianka commented Mar 11, 2021

I'm still seeing some issues in a unit-test I'm writing for my custom filter. Here's a function which builds a filter that should match the semantics from above: "Pass message if key_exists(key) && not(key_matches(key, bad_value)) && not(key_matches(key, ""))".

fn filter<D>(drain: D, level: Level) -> KVFilter<D> where D: Drain {
    KVFilter::new(drain, level)
        .only_pass_on_any_key_present(["err".to_string()].iter())
        .always_suppress_any(Some(
            HashMap::from_iter(
                vec![(
                    "err".to_string(),
                    HashSet::from_iter(vec!["None".to_string(), "".to_string()]),
                )]
            )
        ))
}

#[cfg(test)]
mod tests {
    use slog::{error, debug, info, Logger, Level, Drain};
    use super::filter;

    #[test]
    fn should_not_log_info_messages() {
        let decorator = slog_term::PlainSyncDecorator::new(slog_term::TestStdoutWriter);
        let drain = slog_term::FullFormat::new(decorator).build().fuse();
        let filter = filter(drain, Level::Debug).fuse();
        let logger = Logger::root_typed(filter, o!());

        // should discard
        info!(logger, "NO: test info");
        info!(logger, "NO: test info"; "count" => 10);
        info!(logger, "NO: test error"; "err" => "None");
        info!(logger, "NO: test error"; "err" => "");
        debug!(logger, "NO: test debug");

        // should log to drain
        info!(logger, "YES: test error"; "err" => "Panic!");
        error!(logger, "YES: test error");
    }
}

The output from this test is:

Mar 10 18:58:03.032 INFO NO: test info
Mar 10 18:58:03.033 INFO NO: test info, count: 10
Mar 10 18:58:03.033 INFO NO: test error, err: None
Mar 10 18:58:03.033 INFO NO: test error, err: 
Mar 10 18:58:03.033 INFO YES: test error, err: Panic!
Mar 10 18:58:03.033 ERRO YES: test error

I'm not sure why any of the messages are being filtered. Am I building the logger incorrectly, or am I mis-using the code you just added?

@dvtomas
Copy link
Author

dvtomas commented Mar 11, 2021

@seanpianka Ok, I've implemented what you need. See latest https://github.com/dvtomas/kvfilter/tree/complete-rewrite.

Your exact use-case is implemented here: https://github.com/dvtomas/kvfilter/blob/f856384d9b6bbfbeb531152d439aabc4f3a4afe9/src/lib.rs#L1213

The logic is Pass if key.exists() && key != "None" && key != "" plus some level filtering, resulting (your unit test pasted almost verbatim) in

 #[test]
    fn test_complex_example_3() {
        // Implements https://github.com/slog-rs/kvfilter/pull/16#issuecomment-795856834

        let tester = Tester::new(
            LevelAtLeast(Level::Warning).or(
                LevelAtLeast(Level::Info).and(
                    FilterSpec::key_exists("err")
                        .and(FilterSpec::match_kv("err", "None").not())
                        .and(FilterSpec::match_kv("err", "").not())
                )),
            EvaluationOrder::LoggerAndMessage,
        );

        // should discard
        info!(tester.log, "REJECT: test info");
        info!(tester.log, "REJECT: test info"; "count" => 10);
        info!(tester.log, "REJECT: test error"; "err" => "None");
        info!(tester.log, "REJECT: test error"; "err" => "");
        debug!(tester.log, "REJECT: test debug");

        // should log to drain
        info!(tester.log, "ACCEPT: test error"; "err" => "Panic!");
        error!(tester.log, "ACCEPT: test error");
        tester.assert_accepted(2);
    }

@przygienda
Copy link
Collaborator

przygienda commented Mar 11, 2021 via email

@przygienda
Copy link
Collaborator

przygienda commented Mar 11, 2021 via email

@seanpianka
Copy link

I wrote a unit test that fails, @przygienda. Did you see my message above?

@przygienda
Copy link
Collaborator

przygienda commented Mar 11, 2021 via email

@dvtomas
Copy link
Author

dvtomas commented Mar 11, 2021

@przygienda @seanpianka

Your implementation and my implementation differ in handling the following statement (not included in the original @seanpianka unit test, but possibly important):

debug!(logger, "NO: test debug"; "err" => "Panic!");

My implementation rejects the message (see my newest commit), because it is at the debug level. Your implementation accepts it, because it passes the has err key with a non-supressed value. The question is, which behavior @seanpianka prefers. My guess is that the intent is to reject debug messages altogether, so my implementation would be a better fit, but of course it is up to @seanpianka to decide. Just to let you know, it is quite easy to change the filter configuration in my implementation to give the alternative accept debug behavior.

@przygienda
Copy link
Collaborator

przygienda commented Mar 11, 2021 via email

@dvtomas
Copy link
Author

dvtomas commented Mar 11, 2021

Yeah, I usually do the same. @seanpianka seems to have a different use-case, though, judging by him wanting to have info!(logger, "NO: test info"); rejected. I made my implementation general so as to support any use-case you can think of, everybody has different needs.

Also, your implementation accepts the debug messages, which is something you usually don't want, as you've just claimed yourself. Is there a way to configure your filter to pass @seanpianka 's unit test, but reject debug messages?

przygienda and others added 2 commits April 23, 2021 07:46
# Conflicts:
#	Cargo.toml
#	README.md
#	src/lib.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants