Flow timeout timing/v16.2 stream fixes/v6.1 #12341

victorjulien · 2025-01-06T10:24:53Z

SV_BRANCH=OISF/suricata-verify#2215

Rebased and combined #12300 and #12313, since both require a baseline update.

When a thread fails to spawn, include the thread name in the error message.

No longer init then deinit part of the engine at startup of the unix socket mode.

Timeout checks would access certain fields w/o locking, which could lead to thread safety issues.

Can be used to log when the tcp session reuse logic triggers.

Rename to be consistent with other naming: STREAM_PKT_FLAG_TCP_PORT_REUSE -> STREAM_PKT_FLAG_TCP_SESSION_REUSE

Use a more precise calculation for timing out flows, using both the seconds and the micro seconds. Ticket: OISF#7455.

The flow worker needs to get the opportunity to run the flow update before globally making it's current timestamp available. This is to avoid another thread using the time to evict the flow that is about to get a legitimate update. Ticket: OISF#7455.

If a thread doesn't receive packets for a while the packet timestamp will no longer be used to determine a reasonable minimum timestamp for flow timeout handling. To avoid issues with the minimum timestamp to be set a bit too aggressively, increase the time a thread can be inactive.

Flow Manager skips rows based on a minimized tracker that tracks the next second at which the first flow may time out. If seconds match a flow can still be timing out.

When timing out flows, use the timestamp from the "owning" thread. This avoids problems with threads being out of sync with each other. Ticket: OISF#7455.

As this may mean that a threads ts is a bit ahead of the minimum time the flow manager normally uses, it can evict flows a bit faster. Ticket: OISF#7455.

Until now many accesses to the Thread structure required taking a global lock, leading to performance issues. In practice this only happened in offline mode. This patch adds a finer grained locking scheme. It assumes that the Thread object itself cannot disappear, and adds a spinlock to protect updates to the structure. Additionally, the `pktts` field is made an atomic, so that it can be read w/o taking the spinlock. Updates to it are still done under lock.

The idea of sealing the thread store is that its members can be accessed w/o holding a lock to the whole store at runtime.

Since `Thread` objects are part of a big allocation, more than one Thread could be on a single cache line, leading to false sharing. Atomic updates to one `Thread` could then lead to poor performance accessing another `Thread`. Align to CLS (cache line size) to avoid this.

Some checks can be done w/o holding a lock: - seeing if the flow matches the packet - if the hash row needs a timeout check This patch skips taking a lock in these conditions.

Explain meaning of `ts` in flow managers main loop.

Since forever (1578ef1) a valid RST would update the internal `last_ack` representation to include all unack'd data. This was originally done to make sure the unACK'd data was inspected/processed at flow timeout. It was observed however, that if GAPs existed in this unACK'd data, a GAP could be reported in the stats and a GAP event would be raised. This doesn't make sense, as missing segments in the unACK'd part of the stream are completely normal. Segments simply do not all arrive in order. It turns out that the original behavior of updating `last_ack` to include all unACK'd data is no longer needed. For raw stream inspection, the detection engine will already include the unACK'd data on flow end. For app-layer updates the unACK'd data is often harmful, as the data often has GAPs. Parser like the http parser would report these GAPs and could also get confused about the post-GAP data being a new transaction including a file. This lead to many reported errors and fantom txs and files. Since the GAP detection uses `last_ack` to determine GAPs, not moving `last_ack` addresses the GAP false positives. Ticket: OISF#7422.

codecov · 2025-01-06T10:48:31Z

Codecov Report

Attention: Patch coverage is 89.59538% with 18 lines in your changes missing coverage. Please review.

Project coverage is 83.21%. Comparing base (def22fa) to head (7a58d11).
Report is 18 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12341      +/-   ##
==========================================
- Coverage   83.23%   83.21%   -0.02%     
==========================================
  Files         912      912              
  Lines      257647   257676      +29     
==========================================
- Hits       214450   214434      -16     
- Misses      43197    43242      +45

Flag	Coverage Δ
fuzzcorpus	`61.17% <50.59%> (-0.05%)`	⬇️
livemode	`19.41% <38.69%> (+0.01%)`	⬆️
pcap	`44.38% <82.14%> (-0.02%)`	⬇️
suricata-verify	`62.88% <87.50%> (+0.01%)`	⬆️
unittests	`59.18% <15.20%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

suricata-qa · 2025-01-06T13:05:04Z

Information:

ERROR: QA failed on SURI_TLPW2_autofp_suri_time.

ERROR: QA failed on SURI_TLPW1_files_sha256.

field	baseline	test	%
	SURI_TLPW1_stats_chk
.app_layer.flow.dhcp	563	600	106.57%
	SURI_TLPR1_stats_chk
.flow.end.tcp_liberal	13783	12321	89.39%
.app_layer.error.http.parser	729	574	78.74%
.app_layer.error.tls.gap	1725	1651	95.71%

Pipeline 24079

victorjulien · 2025-01-06T13:10:14Z

Information:

ERROR: QA failed on SURI_TLPW2_autofp_suri_time.

ERROR: QA failed on SURI_TLPW1_files_sha256.
field baseline test %
SURI_TLPW1_stats_chk
.app_layer.flow.dhcp 563 600 106.57%
SURI_TLPR1_stats_chk
.flow.end.tcp_liberal 13783 12321 89.39%
.app_layer.error.http.parser 729 574 78.74%
.app_layer.error.tls.gap 1725 1651 95.71%

Pipeline 24079

These are pretty consistent with #12300 (comment) and #12313 (comment)

The SURI_TLPW1_files_sha256 is explained here #12186 (comment)

catenacyber

CI : 🟢 and QA explained
Code : ok
Commits segmentation : ok, but

does f01d903 eve/flow: log tcp reuse as 'reason' need its own ticket/test ? cf Flow timeout timing/v15 #12278 (review)

Commit messages : ok, thanks for expanding the acronyms (is fantom right English, not phantom ? )
Git ID set : looks fine for me
CLA : you already contributed
Doc update : not needed
Redmine ticket : ok https://redmine.openinfosecfoundation.org/issues/7455 and https://redmine.openinfosecfoundation.org/issues/7422
Rustfmt : no rust
Tests : I think they are ok even if I am not sure I understand them all
Dependencies added: none

jasonish · 2025-01-08T17:14:44Z

src/tm-threads.c

+    thread_store_sealed = true;
+    SCMutexUnlock(&thread_store_lock);
+}
+


Sealing appears to be only checked under DEBUG_VALIDATE. Is this meany a development debugging tool to make sure thread fields are accessed by multiple threads before they are ready to be?

Not entirely sure what you're asking, but the goal is to index the array safely after all threads have been set up. Since the array is dynamically allocated and grown (realloc) as threads are spawning, we need to know for sure when it is ready. The debug validation is there to make sure we can catch it if the order is wrong, so we access w/o lock while unsealed.

It appears the "sealing" only sets a flag, and nothing else. And that flag is only checked with DEBUG_VALIDATE_BUG_ON. Is there more to it?

no, the debug validation check is there to make sure it is respected

Ok, as threads are only registered at startup, why not always check it with a fatal error instead of having to opt-in with debug-validation.

Unix socket spawn and despawns threads in its loop.

Still just on setup and teardown right? I guess my line of thinking is if its important enough to seal, its important enough to check - always.

victorjulien · 2025-01-09T20:52:28Z

replaced by #12370

victorjulien added 19 commits January 6, 2025 10:32

threads: include name in error message

0df7b4f

When a thread fails to spawn, include the thread name in the error message.

unix/socket: cleanup start up logic

96e2b47

No longer init then deinit part of the engine at startup of the unix socket mode.

eve/flow: log tcp reuse as 'reason'

f01d903

flow: improve thread safety during timeout checks

b57616e

Timeout checks would access certain fields w/o locking, which could lead to thread safety issues.

eve/stream: add tcp-session-reuse trigger

5a7d697

Can be used to log when the tcp session reuse logic triggers.

stream: rename tcp reuse flag

8ea3397

Rename to be consistent with other naming: STREAM_PKT_FLAG_TCP_PORT_REUSE -> STREAM_PKT_FLAG_TCP_SESSION_REUSE

time: getter for SCTime_t timestamp of a thread

44f467b

flow: exact flow timeout

9700eb2

Use a more precise calculation for timing out flows, using both the seconds and the micro seconds. Ticket: OISF#7455.

flow: fix flow bucket timestamp optimization

e2c3576

Flow Manager skips rows based on a minimized tracker that tracks the next second at which the first flow may time out. If seconds match a flow can still be timing out.

flow/worker: improve flow timeout time accuracy

76752b7

When timing out flows, use the timestamp from the "owning" thread. This avoids problems with threads being out of sync with each other. Ticket: OISF#7455.

flow/manager: in offline mode, use owning threads time

298eb48

As this may mean that a threads ts is a bit ahead of the minimum time the flow manager normally uses, it can evict flows a bit faster. Ticket: OISF#7455.

threads: seal after setup; unseal at shutdown

5d67598

The idea of sealing the thread store is that its members can be accessed w/o holding a lock to the whole store at runtime.

flow: skip lock for skippable flows

8200192

Some checks can be done w/o holding a lock: - seeing if the flow matches the packet - if the hash row needs a timeout check This patch skips taking a lock in these conditions.

flow/manager: improve doc; minor cleanup

eed2051

Explain meaning of `ts` in flow managers main loop.

This was referenced Jan 6, 2025

stream: RST no longer acks all data #12300

Closed

Flow timeout timing/v16.1 #12313

Closed

victorjulien added the needs baseline update QA will need a new base line label Jan 6, 2025

catenacyber approved these changes Jan 8, 2025

View reviewed changes

victorjulien added this to the 8.0 milestone Jan 8, 2025

jasonish reviewed Jan 8, 2025

View reviewed changes

victorjulien mentioned this pull request Jan 9, 2025

Flow timeout timing/v16.2 stream fixes/v6.1 v2 #12370

Closed

victorjulien closed this Jan 9, 2025

victorjulien mentioned this pull request Jan 10, 2025

next/683/20250110/v1 #12371

Merged

victorjulien deleted the flow-timeout-timing/v16.2-stream-fixes/v6.1 branch January 13, 2025 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flow timeout timing/v16.2 stream fixes/v6.1 #12341

Flow timeout timing/v16.2 stream fixes/v6.1 #12341

victorjulien commented Jan 6, 2025

codecov bot commented Jan 6, 2025 •

edited

Loading

suricata-qa commented Jan 6, 2025

victorjulien commented Jan 6, 2025

catenacyber left a comment •

edited

Loading

jasonish Jan 8, 2025

victorjulien Jan 9, 2025

jasonish Jan 9, 2025

victorjulien Jan 9, 2025

jasonish Jan 9, 2025

victorjulien Jan 9, 2025

jasonish Jan 9, 2025

victorjulien commented Jan 9, 2025

Flow timeout timing/v16.2 stream fixes/v6.1 #12341

Flow timeout timing/v16.2 stream fixes/v6.1 #12341

Conversation

victorjulien commented Jan 6, 2025

codecov bot commented Jan 6, 2025 • edited Loading

Codecov Report

suricata-qa commented Jan 6, 2025

victorjulien commented Jan 6, 2025

catenacyber left a comment • edited Loading

Choose a reason for hiding this comment

jasonish Jan 8, 2025

Choose a reason for hiding this comment

victorjulien Jan 9, 2025

Choose a reason for hiding this comment

jasonish Jan 9, 2025

Choose a reason for hiding this comment

victorjulien Jan 9, 2025

Choose a reason for hiding this comment

jasonish Jan 9, 2025

Choose a reason for hiding this comment

victorjulien Jan 9, 2025

Choose a reason for hiding this comment

jasonish Jan 9, 2025

Choose a reason for hiding this comment

victorjulien commented Jan 9, 2025

codecov bot commented Jan 6, 2025 •

edited

Loading

catenacyber left a comment •

edited

Loading