Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] new(proposal): drop support for syscall enter events #2068

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Andreagit97
Copy link
Member

What type of PR is this?

/kind design

Any specific area of the project related to this PR?

/area proposals

Does this PR require a change in the driver versions?

No

What this PR does / why we need it:

This document proposes removing support for syscall enter events in our codebase. The primary reason behind this proposal is to reduce the throughput of events that our userspace needs to process. As the number of syscalls increases, we are no longer able to support them, and we start dropping events. See the full proposal for more details!

I'm curious to hear some feedback on this, we talked many times about it and maybe now is the right time to tackle it!

Some related discussions:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@poiana
Copy link
Contributor

poiana commented Sep 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Andreagit97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.69%. Comparing base (e2c5174) to head (0a14985).
Report is 131 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2068      +/-   ##
==========================================
- Coverage   74.31%   73.69%   -0.62%     
==========================================
  Files         253      253              
  Lines       30966    31914     +948     
  Branches     5403     5608     +205     
==========================================
+ Hits        23011    23520     +509     
- Misses       7942     8363     +421     
- Partials       13       31      +18     
Flag Coverage Δ
libsinsp 73.69% <ø> (-0.62%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Sep 13, 2024

Perf diff from master - unit tests

     4.20%     +1.68%  [.] next
     3.89%     +0.88%  [.] sinsp_evt::load_params
     4.49%     -0.84%  [.] sinsp_parser::process_event
    11.01%     -0.67%  [.] sinsp_parser::reset
     0.50%     +0.58%  [.] std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, libsinsp::state::dynamic_struct::field_info>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, libsinsp::state::dynamic_struct::field_info> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find
     0.83%     -0.55%  [.] std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, libsinsp::state::dynamic_struct::field_info>, true> > >::_M_deallocate_nodes
     1.09%     -0.55%  [.] std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Identity, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::find
     6.13%     -0.52%  [.] sinsp::next
     1.60%     +0.50%  [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
     3.45%     -0.48%  [.] sinsp_thread_manager::get_thread_ref

Heap diff from master - unit tests

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Heap diff from master - scap file

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Benchmarks diff from master

Comparing gbench_data.json to /root/actions-runner/_work/libs/libs/build/gbench_data.json
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
BM_sinsp_split_mean                                            -0.0075         -0.0075           153           152           153           152
BM_sinsp_split_median                                          -0.0135         -0.0135           153           151           153           151
BM_sinsp_split_stddev                                          +0.0518         +0.0524             1             1             1             1
BM_sinsp_split_cv                                              +0.0597         +0.0604             0             0             0             0
BM_sinsp_concatenate_paths_relative_path_mean                  -0.0238         -0.0238            63            62            63            62
BM_sinsp_concatenate_paths_relative_path_median                -0.0294         -0.0294            63            62            63            62
BM_sinsp_concatenate_paths_relative_path_stddev                +0.5897         +0.5890             0             1             0             1
BM_sinsp_concatenate_paths_relative_path_cv                    +0.6285         +0.6278             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_mean                     -0.0061         -0.0061            25            25            25            25
BM_sinsp_concatenate_paths_empty_path_median                   -0.0049         -0.0049            25            25            25            25
BM_sinsp_concatenate_paths_empty_path_stddev                   +0.0329         +0.0354             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_cv                       +0.0393         +0.0418             0             0             0             0
BM_sinsp_concatenate_paths_absolute_path_mean                  +0.0420         +0.0420            61            64            61            64
BM_sinsp_concatenate_paths_absolute_path_median                +0.0498         +0.0498            61            64            61            64
BM_sinsp_concatenate_paths_absolute_path_stddev                -0.5577         -0.5576             1             1             1             1
BM_sinsp_concatenate_paths_absolute_path_cv                    -0.5755         -0.5754             0             0             0             0
BM_sinsp_split_container_image_mean                            +0.0072         +0.0072           396           399           396           399
BM_sinsp_split_container_image_median                          +0.0057         +0.0057           396           398           396           398
BM_sinsp_split_container_image_stddev                          -0.1906         -0.1861             3             3             3             3
BM_sinsp_split_container_image_cv                              -0.1964         -0.1919             0             0             0             0

@Andreagit97
Copy link
Member Author

any feedback on this? I continuously see many users with a lot of drops...

@Andreagit97
Copy link
Member Author

@falcosecurity/libs-maintainers I will try to create a branch ASAP to show you some concrete code on this

@FedeDP
Copy link
Contributor

FedeDP commented Oct 8, 2024

I would love to join you for this enormous task 🚀

Copy link
Contributor

@Molter73 Molter73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sold on the idea. As I mention on another comment, this proposal is basically saying that we should rewrite all drivers and parsing logic. Even if that is not the case, it will take tons of changes in them which can cause bugs, regressions and what-have-yous. The proposal states that we will be sending half the data to userspace in some parts, but if all data is moved to exit events, that is not actually true, we will send half the events that will be almost twice as big, meaning we will be saving the length of the enter event header, maybe a bit more if we find redundant data on some events.

I also don't see the need to move ALL syscalls to use only exit events, I would argue most of the performance impact surely comes from syscalls like read/open/send/recv and their relatives, since they are the most frequently triggered (the syscall fingerprint even shows these are triggered at least twice as much as other syscalls), why not try changing only some of those and see if that is good enough?

IDK, it all seems like a huge amount of efforts, a big risk and I don't see how the enormous gains the proposal is being sold on will manifest themselves. I'd be happy to be proven wrong though.

Comment on lines +322 to +327
* Update our three drivers to collect all parameters in the exit fillers and remove the enter ones.
* Update scap-file management to convert all enter events generated over the years into new exit events.
* Update our userspace engines to support only exit events.
* Update the sinsp parsing and extraction logic to use only exit events.
* Update all tests (drivers, libscap, unit, e2e) to use only exit events.
* Update all documentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of work, and I mean A LOT of work. From where I stand, this is basically a rewrite of the parsing logic of sinsp and all 3 drivers. This is a ton of risk and I'm not convinced moving everything to exit events will have the huge impact you are expecting it to have. In my mind, for me to sign off on this, you'd need to have CPU and data savings of way above 50%, otherwise it's too much risk of having regressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I can fully understand you. This is a lot of work.
The point of this proposal is that today we don't scale on machines with more than 96 CPUs (an example here https://gist.github.com/stevenbrz/69391aa71b22d205b6add88ae10fb905#file-gistfile1-txt-L126-L166) and we need to find a solution. What is proposed here is not the final solution but at least I see it as a first step toward it. IMO Going from 10 Mevt/s to 5 Mevt/s is already something

you'd need to have CPU and data savings of way above 50%

Yes, we probably won't reach something like this but we will process half of the event in userspace and this sounds like already something good. Unfortunately having precise data before doing the change would be very complicated...

@FedeDP
Copy link
Contributor

FedeDP commented Oct 8, 2024

@Molter73
As far as i see it, we will:

  • send and then parse half of the events, and that means we can increase the throughput in userspace and have much less drops. Even if not doubling, a 30/40% increase alone would be huge and IMHO worth the big effort.
  • by changing the size of some type (eg: fd and errno to 32bits) we will save lots of precious bytes for every event sent to userspace
  • reach a more consistent codebase, since we will have less parsers, less parsers' magic (eg: store_evt()/retrieve_old_evt())
  • no more syscall_enter related wake-ups for our drivers
  • halve the glue code to support new syscalls: no more 2 events in the event_table, 2 fillers, 2 driver tests, 2 parsers, and so on. This is a nice gain for the futurity of the project.

I know andre has more data to back his proposal for sure :)

@Molter73
Copy link
Contributor

Molter73 commented Oct 8, 2024

  • send and then parse half of the events

Fair, but the amount of data and fields being parsed would still be almost the same, right? What I mean is, you would only skip the enter header, then go on to parse the same number of fields effectively, you'd just be doing that in one go, so the total processing done in userspace would be the same.

  • Even if not doubling, a 30/40% increase alone would be huge and IMHO worth the big effort.

You are very optimistic about the amount of gains we'll get, I don't think it's gonna get even close to that, but again, I'd be happy to be proven wrong.

  • by changing the size of some type (eg: fd and errno to 32bits) we will save lots of precious bytes for every event sent to userspace

I don't see what this has to do with moving all fields to the exit side. Why couldn't we do this with the existing model?

  • reach a more consistent codebase, since we will have less parsers, less parsers' magic (eg: store_evt()/retrieve_old_evt())

  • no more syscall_enter related wake-ups for our drivers

  • halve the glue code to support new syscalls: no more 2 events in the event_table, 2 fillers, 2 driver tests, 2 parsers, and so on. This is a nice gain for the futurity of the project.

These 3 are fair points, though they still require a massive rewrite that I'm not fully OK with.

If you guys can show me how we would get 30% performance improvement from just shuffling things around, I'll gladly eat my words, for now, all I see is a massive rewrite that will be a pain to go through and put us in a position for a bunch of new bugs to pop up.

@Andreagit97
Copy link
Member Author

Andreagit97 commented Oct 8, 2024

Fair, but the amount of data and fields being parsed would still be almost the same, right? What I mean is, you would only skip the enter header, then go on to parse the same number of fields effectively, you'd just be doing that in one go, so the total processing done in userspace would be the same.

Uhm more or less. Today for the most frequent events we store the whole event in memory and we reuse it in the exit.
just to give you a pointer

case PPME_SYSCALL_OPEN_E:

This is not only a cost (in scenarios where we have 4 Million open/s we save 2 Millions of OPEN_E per second, same for close, socket, etc...) but it's also a useless complication and a risk... sometimes if we drop the enter event we are not able to correctly process the exit one and so we end up doing a sort of approximation without any reason. Why should the information be shared between enter and exit events?

so the total processing done in userspace would be the same.

Moreover, we need to consider that there is not only the parsing phase, after the parsing the event is sent to all the filter checks and we need to evaluate all the filter checks on events that don't carry a real value...

You are very optimistic about the amount of gains we'll get, I don't think it's gonna get even close to that, but again, I'd be happy to be proven wrong.

Well, that's true we have no data to discuss about this and unfortunately, it's very difficult to obtain them without implementing the solution...

by changing the size of some type (eg: fd and errno to 32bits) we will save lots of precious bytes for every event sent to userspace

I don't see what this has to do with moving all fields to the exit side. Why couldn't we do this with the existing model?

This would require generating new event pairs for almost all the events that have at least one between fd and pid fields. It makes sense to do it if part of a broader refactor like the one we are talking about here.

If you guys can show me how we would get 30% performance improvement from just shuffling things around, I'll gladly eat my words, for now, all I see is a massive rewrite that will be a pain to go through and put us in a position for a bunch of new bugs to pop up.

Well, I see your point, really.
On the driver side, I'm not so worried, we have tests for each syscall and we don't need to change the logic, we just need to move the logic in only on filler.
The userspace side is for sure more worrying but doing it for the open syscall I can say that is more of a burden fixing the tests accordingly than updating the parsers... Enter events are just stored in memory and reused in the exit they don't have a big role in the core parsing logic, so having all the parameters in the exit just simplifies the flow without adding additional complexity, or at least this is how I see it. Moreover, we need to highlight that many events are not parsed at all in userspace so we need to update nothing.

Doing this work just for one event I've already spotted some inconsistencies/bugs, so it's true that we will introduce for sure some new bugs but today we already have many of them hidden somewhere. So at the end of the work, I'm not sure what would be the balance of introduced/fixed bugs...

BTW in any case I would proceed with a separate branch, doing PR after PR would be very difficult and dangerous as you correctly highlighted.

Apart from the huge amount of changes I don't see many other downsides

  • we should achieve better instrumentation time kernel side
  • less CPU time spent in userspace
  • cleaner and more intuitive userspace logic

But yes I see how these changes can be scary...

Signed-off-by: Andrea Terzolo <[email protected]>
Co-authored-by: Mauro Ezequiel Moltrasio <[email protected]>
@Andreagit97
Copy link
Member Author

Just because we talked about byte saving i would add a bit of information. This is not strictly related to this initiative but if we decide to go in this way we could think about it...

Today event schema

/*
 * SCAP EVENT FORMAT:
 *
 * +------------------------------------------+
 * |  Header	| u16 | u16 | param1 | param2 |
 * +------------------------------------------+
 * ^        	   ^                             ^
 * |        	   |                             |
 * ppm_hdr  	   |                             |
 * lengths_arr--+                             |
 * raw_params---------------------------------+
 */

// 26 bytes
struct ppm_evt_hdr {
	uint64_t ts; /* timestamp, in nanoseconds from epoch */
	uint64_t tid; /* the tid of the thread that generated this event */
	uint32_t len; /* the event len, including the header */
	uint16_t type; /* the event type */
	uint32_t nparams; /* the number of parameters of the event */
};

Why don't we do something like this?

/* +----------+
 * |  Header  |
 * +----------+
 */

typedef uint16_t evt_hdr;


/* Concrete example */
[PPME_SYSCALL_CLOSE_X] = {"close",
                              	EC_IO_OTHER | EC_SYSCALL,
                              	EF_DESTROYS_FD | EF_USES_FD | EF_MODIFIES_STATE,
                              	2,
                              	{{"res", PT_ERRNO, PF_DEC},
                              	{{"fd", PT_FD, PF_DEC}}},



/* +--------------------------------------------+
 * |  TYPE_CLOSE_X  | len | tid | ts | res | fd |
 * +--------------------------------------------+
 *
 * - Total len: 16 + 16 + 32 + 64 + 32 + 32 -> (we save 4 extra bytes because we don't need the len of `res` and `fd`). Note that this is called `2/3 Million/s`)
 */

/* +----------------------------------------------------------------------------------------------+
 * |  TYPE_OPEN_X  | len | tid | ts | res | fd | flags | mode | dev | ino | len_name (u16) | name |
 * +----------------------------------------------------------------------------------------------+
 *
 * - Total len: 16 + 16 + 32 + 64 + 32 + 32 + 32 + 32 + 32 + 64 + 16 + strlen(name) -> we save 12 extra bytes
 *
 */

/* +----------------------------------------------------------------------------------------------+
 * |  TYPE_MMAP_X  | len | tid | ts | res | ... |
 * +----------------------------------------------------------------------------------------------+
 *
 * - This design would allow us to handle special cases, for example mmap has a return value on 64 bit and we can easily handle it because sinsp will know it when it will parse an event `TYPE_MMAP_X`
 *
 */
  • The header is only the type on 16 bits
  • every type can have its own schema.
  • we don't have the len array anymore we just push the total len and the len before each variable-size parameter. For fixed-size parameters, we already have the len in the event table codified in its type.

Just to give a complete full example of a possible saving. Consider the PPME_SYSCALL_OPENAT_2_X event:

 [PPME_SYSCALL_OPENAT_2_X] = {"openat",
                               EC_FILE | EC_SYSCALL,
                               EF_CREATES_FD | EF_MODIFIES_STATE,
                               7,
                               {{"fd", PT_FD, PF_DEC},
                                {"dirfd", PT_FD, PF_DEC},
                                {"name", PT_FSRELPATH, PF_NA, DIRFD_PARAM(1)},
                                {"flags", PT_FLAGS32, PF_HEX, file_flags},
                                {"mode", PT_UINT32, PF_OCT},
                                {"dev", PT_UINT32, PF_HEX},
                                {"ino", PT_UINT64, PF_DEC}}},

Today: 26 (header) + 14 (len array) + 8 (fd on 8 bit) + 8 + X (for the name) + 4 + 4 + 4 + 8 = 76 Bytes + X

With the proposed approach: 16 (type+len+tid+ts) + 4 (fd on 4 bytes) + 4 (dirfd on 4 bytes) + 4 + 4 + 4 + 8 + 2 (only the name field needs the len) + X = 46 bytes + X

We save 30 bytes!

@gnosek
Copy link
Contributor

gnosek commented Oct 9, 2024

Taking a step back: okay, we can't handle $MILLIONS evt/sec right now, but why exactly? Is the bottleneck in scap_next (and dropping the enter events will help a lot) or is it in the sinsp parsers (and dropping half the events will help only a little bit, as we save on dispatch but the data fields we process are basically the same)?

If you happen to have a 96 core machine to play with, it might be interesting to see how fast we can get with just scap (e.g. scap-open with any sort of printing disabled) vs sinsp (sinsp-example, again, with printing disabled to keep the playing field level).

My expectation is that raw scap will scale to an absurd extent (no reason it wouldn't be faster than memcpy, it's basically an atomic fetch and add per event batch) and it's sinsp that's slow, but I'm ready to be surprised.

(if you do run this test, please gather perf profiles and make some pretty flame graphs :))

Note that we have a mostly random sleep inside (kmod's) scap_next, so on less than full load the perf won't be ideal.
BTW, did we ever consider a futex (equivalent) instead of the sleep?

@Andreagit97 I like the proposed new layout a lot, though I'm mildly worried about having a variable-length res field. I'd sleep minimally better if it was just native register size (8 bytes these days) unconditionally.

@Molter73
Copy link
Contributor

Molter73 commented Oct 9, 2024

Uhm more or less. Today for the most frequent events we store the whole event in memory and we reuse it in the exit. just to give you a pointer

case PPME_SYSCALL_OPEN_E:

Yes, I know about the storing logic, that's not ideal but it's basically a memcpy, I'm sure playing around with allocators could give better performance (arena allocators, use something like tcmalloc, etc..), again, this doesn't need a large refactoring. I do, on the other hand, strongly agree that we shouldn't drop entire events when we drop the enter side, this is a way stronger selling point for the proposal than performance IMO.

Moreover, we need to consider that there is not only the parsing phase, after the parsing the event is sent to all the filter checks and we need to evaluate all the filter checks on events that don't carry a real value...

I'm not super familiar with filterchecks, but I would argue this is a mistake on the filtercheck logic which should ignore events that are not useful, not on the driver and parsing logic.

Doing this work just for one event I've already spotted some inconsistencies/bugs, so it's true that we will introduce for sure some new bugs but today we already have many of them hidden somewhere. So at the end of the work, I'm not sure what would be the balance of introduced/fixed bugs...

I recently read an interesting article on how new code is way more likely to have bugs and vulnerabilities than old code, so if you've already spotted bugs on one syscall, I think we can assume we are gonna come out the other end with a negative balance. I'm not trying to completely push back here, but we do need to be aware of the risk of doing such a large refactoring.

BTW in any case I would proceed with a separate branch, doing PR after PR would be very difficult and dangerous as you correctly highlighted.

Just a suggestion on the dev workflow here, I wouldn't use a single branch with all the changes, that will be a nightmare to review. I suggest you have a feature branch and open PRs against that branch with small changes (maybe one or two syscalls per PR?), that way the changes can be thoroughly reviewed and the final merge to master should be less painful for everyone.

  • every type can have its own schema.

How do you handle driver <-> scap compatibility like this? If the schema changes in any way, you will no longer be able to parse an event from an older driver, right? If a field changes from 32 to 64 bits, how would scap know to only read 32 bits when the older driver is being used?

I agree with everything in @gnosek's comment BTW, running the tests he proposes would be very interesting.

@Andreagit97
Copy link
Member Author

Taking a step back: okay, we can't handle $MILLIONS evt/sec right now, but why exactly? Is the bottleneck in scap_next (and dropping the enter events will help a lot) or is it in the sinsp parsers (and dropping half the events will help only a little bit, as we save on dispatch but the data fields we process are basically the same)?

If you happen to have a 96 core machine to play with, it might be interesting to see how fast we can get with just scap (e.g. scap-open with any sort of printing disabled) vs sinsp (sinsp-example, again, with printing disabled to keep the playing field level).

My expectation is that raw scap will scale to an absurd extent (no reason it wouldn't be faster than memcpy, it's basically an atomic fetch and add per event batch) and it's sinsp that's slow, but I'm ready to be surprised.

I took some data on 80 CPUs in the past for a completely different reason (ringbuffer scraping) but I think they still report something relevant. https://docs.google.com/document/d/1CLIP4cdAGa6MQnMffQKeVT7Z18rzAyX6XkLZ-HFns1E/edit

  1. It shows the difference between parsing an openat and a fstat in sinsp on 16 CPUs. Storing failed openat in memory has probably a not negligible cost when we do it 3 M/s times.
  2. It shows that the bottleneck is in sinsp because with scap we can handle huge throughputs (~ 14 Mevt/s on 80 CPUs with a sequential buffer approach)
  3. It shows that on 80 CPUs we hit some limits when we try to handle more than 2,4 Mevt/s in sinsp. Please note that we are talking about 1,2 Msyscall/s because we have enter+exit. As stated in the document we say that we support throughputs of 2,4 Mevt/s but this depends on:
    • How the traffic is distributed among the available CPUs
    • How the syscall are distributed in time (e.g. 5 syscall + 40 usec of sleep)

TL;DR; the solution proposed here is not a definitive solution, but it should allow us to gain some time while we come up with a definitive solution (multi-threading, kernel filtering, whatever...). Moreover, apart from halving the number of events we send to userspace, this solution brings some important optimization

  • it allows us to reduce kernel-side instrumentation time (something we've always tried to do - see the document here eBPF vs. KMod Performance #267).
  • It allows us to avoid storing events in memory, as in the case of openat.
  • If we apply optimizations to the event schema, it allows us to save a considerable number of bytes per event.
  • It also allows us to simplify the parsing flow a bit, reducing inconsistencies and complex logic that we currently have.

To be honest, aside from the major refactor, I see only possible improvements...

@Andreagit97
Copy link
Member Author

@Andreagit97 I like the proposed new layout a lot, though I'm mildly worried about having a variable-length res field. I'd sleep minimally better if it was just native register size (8 bytes these days) unconditionally.

that makes sense but instead of using a common PT_ERRNO type for all the return values we could assign the precise type the return value has directly in the event table, so for open/openat we will have PT_FD for mmap PT_POINTER and so on, in this way in the event table we will find the information on the return value size

        [PPME_SYSCALL_OPEN_X] = {"open", ..., {{"res", PT_FD, PF_DEC},
        [PPME_SYSCALL_MMAP_X] = {"mmap", ..., {{"res", PT_POINTER, PF_HEX},

And so on, instead of a generic PT_ERRNO

@gnosek
Copy link
Contributor

gnosek commented Oct 9, 2024

What I'm concerned about is having to look up the event schema before we even know if the event is successful or not. Maybe it's not a real issue, idk 🤷

@Andreagit97
Copy link
Member Author

How do you handle driver <-> scap compatibility like this? If the schema changes in any way, you will no longer be able to parse an event from an older driver, right? If a field changes from 32 to 64 bits, how would scap know to only read 32 bits when the older driver is being used?

That's a good point.
If we decide to change the schema for sure we will bump a major for the SCHEMA VERSION, so new libs versions won't be able to use old driver versions. This is for live captures. If we talk about replaying a capture (scap-files) we need a converter but this should be quite easy to implement since we have versions for the event blocks

@Andreagit97
Copy link
Member Author

Andreagit97 commented Oct 9, 2024

What I'm concerned about is having to look up the event schema before we even know if the event is successful or not. Maybe it's not a real issue, idk 🤷

If we see this is an issue I agree with you that we should fall back to something standard like always 64 bits

@gnosek
Copy link
Contributor

gnosek commented Oct 9, 2024

Just a suggestion on the dev workflow here, I wouldn't use a single branch with all the changes, that will be a nightmare to review. I suggest you have a feature branch and open PRs against that branch with small changes (maybe one or two syscalls per PR?), that way the changes can be thoroughly reviewed and the final merge to master should be less painful for everyone.

Alternatively, we can have a global (per sinsp) flag to drop the enter events and introduce support for it gradually (merging each step to master, with passing tests for both scenarios). This won't require a long-lived side branch that will be a pain to maintain before it's merged (ask me how I know). In the end, we can remove the flag (permanently dropping enter events) or leave it as is if we want.

@leogr
Copy link
Member

leogr commented Oct 9, 2024

Since the discussion is becoming interesting, I'm joining in with my two cents 😸

I've been talking with Andrea about this proposal for a while, and even though I tend to be conservative, I think there's an opportunity for a substantial improvement here.

I guess the core point here is the issue of drops. We have been evaluating many solutions for years, and honestly, anything we have approached has just mitigated the situation. Still, we have never seen a real improvement. The root cause is likely the legacy design, which we have never tried to change radically.

This proposal favors structural changes that may result in a substantial improvement. So, I'm not against it unless the numbers show something different.

Regarding bugs, I'm not so worried since we are already in a situation where it's tough to discover and fix them (primarily due to the legacy design). We have to deal with bugs anyway, so this should not be a blocker. Somewhat the new solution should reduce the moving parts a bit, which might even result in easier maintenance and fewer bugs in the long term.

That said, I would like to give more reasons once we have a PoC. Now that we don't plan to introduce significant new features, I would definitely avoid stagnation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants