From e42706b2837108a399aea96beec012b4e884ce8f Mon Sep 17 00:00:00 2001 From: mstopa-splunk Date: Mon, 9 Sep 2024 08:40:40 +0000 Subject: [PATCH 1/7] docs: add Architecture and Load Balancers --- docs/architecture.md | 56 -------- docs/architecture/index.md | 54 +++++++ docs/architecture/lb/index.md | 17 +++ docs/architecture/lb/nginx-os.md | 190 +++++++++++++++++++++++++ docs/architecture/performance-tests.md | 129 +++++++++++++++++ docs/architecture/tcp-optimization.md | 72 ++++++++++ docs/architecture/udp-optimization.md | 75 ++++++++++ docs/lb.md | 13 -- docs/performance.md | 56 -------- mkdocs.yml | 13 +- 10 files changed, 547 insertions(+), 128 deletions(-) delete mode 100644 docs/architecture.md create mode 100644 docs/architecture/index.md create mode 100644 docs/architecture/lb/index.md create mode 100644 docs/architecture/lb/nginx-os.md create mode 100644 docs/architecture/performance-tests.md create mode 100644 docs/architecture/tcp-optimization.md create mode 100644 docs/architecture/udp-optimization.md delete mode 100644 docs/lb.md delete mode 100644 docs/performance.md diff --git a/docs/architecture.md b/docs/architecture.md deleted file mode 100644 index 1fe98b048d..0000000000 --- a/docs/architecture.md +++ /dev/null @@ -1,56 +0,0 @@ -# SC4S Architectural Considerations - -SC4S provides performant and reliable syslog data collection. When you are planning your configuration, review the following architectural considerations. These recommendations pertain to the Syslog protocol and age, and are not specific to Splunk Connect for Syslog. - -## The syslog Protocol - -The syslog protocol design prioritizes speed and efficiency, which can occur at the expense of resiliency and reliability. User Data Protocol (UDP) provides the ability to "send and forget" events over the network without regard to or acknowledgment of receipt. Transport Layer Secuirty (TLS) and Secure Sockets Layer (SSL) protocols are also supported, though UDP prevails as the preferred syslog transport for most data centers. - -Because of these tradeoffs, traditional methods to provide scale and resiliency do not necessarily transfer to syslog. - -## IP protocol - -By default, SC4S listens on ports using IPv4. IPv6 is also supported, see `SC4S_IPV6_ENABLE` in [source configuration options](https://splunk.github.io/splunk-connect-for-syslog/main/configuration/#syslog-source-configuration). - -## Collector Location - -Since syslog is a "send and forget" protocol, it does not perform well when routed through substantial network infrastructure. This -includes front-side load balancers and WAN. The most reliable way to collect syslog traffic is to provide for edge -collection rather than centralized collection. If you centrally locate your syslog server, the UDP and (stateless) -TCP traffic cannot adjust and data loss will occur. - -## syslog Data Collection at Scale -As a best practice, do not co-locate syslog-ng servers for horizontal scale and load balance to them with a front-side load balancer: - -* Attempting to load balance for scale can cause more data loss due to normal device operations -and attendant buffer loss. A simple, robust single server or shared-IP cluster provides the best performance. - -* Front-side load balancing causes inadequate data distribution on the upstream side, leading to uneven data load on the indexers. - -## High availability considerations and challenges - -Load balancing for high availability does not work well for stateless, unacknowledged syslog traffic. More data is preserved when you use a more simple design such as vMotioned VMs. With syslog, the protocol itself is prone to loss, and syslog data collection can be made "mostly available" at best. - -## UDP vs. TCP - -Run your syslog configuration on UDP rather than TCP. - -The syslogd daemon optimally uses UDP for log forwarding to reduce overhead. This is because UDP's streaming method does not require the overhead of establishing a network session. -UDP reduces network load on the network stream with no required receipt verification or window adjustment. - -TCP uses Acknowledgement Signals (ACKS) to avoid data loss, however, loss can still occur when: - -* The TCP session is closed: Events published while the system is creating a new session are lost. -* The remote side is busy and cannot send an acknowledgement signal fast enough: Events are lost due to a full local buffer. -* A single acknowledgement signal is lost by the network and the client closes the connection: Local and remote buffer are lost. -* The remote server restarts for any reason: Local buffer is lost. -* The remote server restarts without closing the connection: Local buffer plus timeout time are lost. -* The client side restarts without closing the connection. -* Increased overhead on the network can lead to loss. - -Use TCP if the syslog event is larger than the maximum size of the UDP packet on your network typically limited to Web Proxy, DLP, and IDs type sources. -To mitigate the drawbacks of TCP you can use TLS over TCP: - -* The TLS can continue a session over a broken TCP to reduce buffer loss conditions. -* The TLS fills packets for more efficient use of memory. -* The TLS compresses data in most cases. \ No newline at end of file diff --git a/docs/architecture/index.md b/docs/architecture/index.md new file mode 100644 index 0000000000..ad1203b374 --- /dev/null +++ b/docs/architecture/index.md @@ -0,0 +1,54 @@ +# Architectural Considerations + +Building a syslog ingestion architecture is complex and requires careful planning. The syslog protocol prioritizes speed and efficiency, often at the expense of resiliency and reliability. Due to these trade-offs, traditional scaling methods may not be directly applicable to syslog. + +This document outlines recommended architectural solutions, along with alternative or unsupported methods that some users have found viable. + +## Edge vs. Centralized Collection + +While TCP and TLS are supported, UDP remains the dominant protocol for syslog transport in many data centers. Since syslog is a "send and forget" protocol, it performs poorly when routed through complex network infrastructures, including front-end load balancers and WAN. + +### Recommendation: Use Edge Collection + +The most reliable way to gather syslog traffic is through edge collection rather than centralized collection. If your syslog server is centrally located, UDP and stateless TCP traffic cannot adapt, leading to data loss. + +## Avoid Load Balancing for Syslog + +For optimal performance, scale vertically by fine-tuning a single, robust server. Key tools and methods for enhancing performance on your SC4S server are documented in: + +1. [Fine-tune for TCP](tcp-optimization.md) +2. [Fine-tune for UDP](udp-optimization.md) + +We advise against co-locating syslog-ng servers for horizontal scaling with load balancers. The challenges of load balancing for horizontal scaling are outlined in the [Load Balancer's Overview](lb/index.md) section. + +## High Availability (HA) Considerations + +Syslog, being prone to data loss, can only achieve "mostly available" data collection. + +### HA Without Load Balancers + +Load balancing does not suit syslog’s stateless, unacknowledged traffic. More data is preserved with simpler designs, such as vMotioned VMs. + +The optimal deployment model for high availability uses a [Microk8s](https://microk8s.io/) setup with MetalLB in BGP mode. This method implements load balancing through destination network translation, providing better HA results. + +## UDP vs. TCP + +Syslog optimally uses UDP for log forwarding due to its low overhead and simplicity. UDP's streaming nature eliminates the need for network session establishment, which reduces network strain and avoids complex verification processes. + +### Drawbacks of TCP + +While TCP uses acknowledgement signals (ACKS) to mitigate data loss, issues still arise, such as: + +- Loss of events during TCP session establishment +- Slow acknowledgment signals leading to buffer overflows +- Lost acknowledgments causing closed connections +- Data loss during server restarts + +### When to Use UDP vs. TCP + +Use UDP by default for syslog forwarding, switching to TCP for larger syslog events that exceed UDP packet limits (common with Web Proxy, DLP, and IDS sources). + +The following resources will help you choose the best protocol for your setup: + +1. [Run performance tests for TCP](performance-tests.md#check-your-tcp-performance) +2. [Run performance tests for UDP](performance-tests.md#check-your-udp-performance) \ No newline at end of file diff --git a/docs/architecture/lb/index.md b/docs/architecture/lb/index.md new file mode 100644 index 0000000000..7fc63e2296 --- /dev/null +++ b/docs/architecture/lb/index.md @@ -0,0 +1,17 @@ +# Load Balancers Are Not a Best Practice for SC4S + +Be aware of the following issues that may arise from load balancing syslog traffic: +- Load balancing for scale can lead to increased data loss due to normal device operations and buffer overflows. +- Front-side load balancing often results in uneven data distribution on the upstream side. +- The default behavior of Layer 4 (L4) load balancers is to overwrite the client's source IP with their own. Preserving the real source IP requires additional configuration. + +### Recommendations for Using Load Balancers: +- Preserve the actual source IP of the sending device. +- Avoid using load balancers without High Availability (HA) mode. +- TCP/TLS load balancers often do not account for the load on individual connections and may favor one instance over others. Ensure all members in a resource pool are vertically scaled to handle the full workload. + +For **TCP/TLS**, you can use either a DNAT configuration or SNAT with the "PROXY" protocol enabled by setting `SC4S_SOURCE_PROXYCONNECT=yes`. +For **UDP**, traffic can only pass through a load balancer using DNAT. + +This section of the documentation discusses various load balancing solutions and potential configurations, along with known issues. +Please note that load balancing syslog traffic in front of SC4S is not supported by Splunk, and additional support from the load balancer vendor may be required. \ No newline at end of file diff --git a/docs/architecture/lb/nginx-os.md b/docs/architecture/lb/nginx-os.md new file mode 100644 index 0000000000..91c73fd73f --- /dev/null +++ b/docs/architecture/lb/nginx-os.md @@ -0,0 +1,190 @@ +# Nginx Open Source + +This section of the documentation describes the challenges of load balancing syslog traffic using Nginx Open Source. + +There are several key disadvantages to using Nginx Open Source for this purpose: +- Nginx Open Source does not provide active health checking, which is essential for UDP DSR (Direct Server Return) load balancing. +- Even with round-robin load balancing, traffic distribution can often be uneven, leading to overloaded instances in the pool. This results in growing queues, causing delays, data drops, and potential memory or disk issues. +- Without High Availability, an Nginx Open Source load balancer becomes a new single point of failure. + +**Please note that Splunk only supports SC4S**. If issues arise due to the load balancer, please reach out to the Nginx support team. + +## Install Nginx + +1. Refer to the Nginx documentation for instructions on installing Nginx **with the stream module**, which is required for TCP/UDP load balancing. For example, on Ubuntu: +```bash +sudo apt update +sudo apt -y install nginx libnginx-mod-stream +``` + +2. (Optionally) Refer to the Nginx documentation for instructions on fine-tuning Nginx performance. For example, you can update the `events` section in your Nginx configuration file: + +`/etc/nginx/nginx.conf` +```conf +events { + worker_connections 20480; + multi_accept on; + use epoll; +} +``` +Please note that actual load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. + +## Preserving Source IP +The default behavior of Nginx is to overwrite the source IP with the LB's IP. While some users accept this behavior, it is recommended to preserve the original source IP of the message. + +Nginx offers three methods to preserve the source IP: + +| Method | Protocol | +|-----------------------------|------------| +| PROXY protocol | TCP* | +| Transparent IP | TCP/TLS | +| Direct Server Return (DSR) | UDP | + +* TLS PROXY protocol support in SC4S is scheduled for implementation. + +Examples for setting up Nginx with the PROXY protocol and DSR are provided below. The Transparent IP method requires complex network configuration. For more details, refer to [this Nginx blog post](https://www.f5.com/company/blog/nginx/ip-transparency-direct-server-return-nginx-plus-transparent-proxy). + + +## Option 1: Configure Nginx Open Source with the PROXY Protocol + +### Advantages: +- Easy to set up + +### Disadvantages: +- Available only for TCP, not for UDP or TLS +- Overwriting the source IP in SC4S is not ideal; the `SOURCEIP` is a hard macro and only `HOST` can be overwritten +- Overwriting the source IP is available only in SC4S versions greater than 3.4.0 + +### Configuration + +1. On your load balancer (LB) node, add a configuration similar to the following: +`/etc/nginx/modules-enabled/sc4s.conf` +```conf +stream { + # Define upstream for each of SC4S hosts and ports + # Default SC4S TCP ports are 514, 601 + # Include your custom ports if applicable + upstream stream_syslog_514 { + server :514; + server :514; + } + upstream stream_syslog_601 { + server :601; + server :601; + } + + # Define a common configuration block for all servers + map $server_port $upstream_name { + 514 stream_syslog_514; + 601 stream_syslog_601; + } + + # Define a virtual server for each upstream connection + # Ensure 'proxy_protocol' is set to 'on' + server { + listen 514; + listen 601; + proxy_pass $upstream_name; + + proxy_timeout 3s; + proxy_connect_timeout 3s; + + proxy_protocol on; + } +} +``` + +3. Refer to the Nginx documentation to find the command to reload the service, for example: +```bash +sudo nginx -s reload +``` + +4. Add the following parameter to the SC4S configuration and restart your instances: +`/opt/sc4s/env_file` +```conf +SC4S_SOURCE_PROXYCONNECT=yes +``` + +### Test Your Setup +Send TCP messages to the load balancer and verify that they are correctly received in Splunk with the host set to your source IP, not the LB's IP: + +```bash +# Test message without IETF frame for port 514/TCP: +echo "hello world" | netcat 514 +# Test message with IETF frame for port 601/TCP: +echo "11 hello world" | netcat 601 +``` + +3. Run performance tests based on the [Check TCP Performance](tcp_performance_tests.md) section. + +| Receiver | Performance | +|---------------------------|--------------------------------| +| Single SC4S Server | 4,341,000 (71,738.98 msg/sec) | +| Load Balancer + 2 Servers | 5,996,000 (99,089.03 msg/sec) | + +Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in increasing the TCP throughput of your load balancer instance, contact the Nginx support team. + +## Option 2: Configure Nginx with DSR (Direct Server Return) + +### Advantages: +- Works for UDP +- Saves one hop and additional wrapping + +### Disadvantages: +- DSR setup requires active health checks because the load balancer cannot expect responses from the upstream. Active health checks are not available in Nginx Open Source, so switch to Nginx Plus or implement your own active health checking. +- Requires superuser privileges. +- For cloud users, this might require disabling `Source/Destination Checking` (tested with AWS). + +1. In the main Nginx configuration, update the `user` to root: +`/etc/nginx/nginx.conf` +```conf +user root; +``` + +2. Add a configuration similar to the following in: +`/etc/nginx/modules-enabled/sc4s.conf` +```conf +stream { + # Define upstream for each of SC4S hosts and ports + # Default SC4S UDP port is 514 + # Include your custom ports if applicable + upstream stream_syslog_514 { + server :514; + server :514; + } + + # Define connections to each of your upstreams. + # Ensure to include `proxy_bind` and `proxy_responses 0`. + server { + listen 514 udp; + proxy_pass stream_syslog_514; + + proxy_bind $remote_addr:$remote_port transparent; + proxy_responses 0; + } +} +``` + +3. Refer to the Nginx documentation to find the command to reload the service, for example: +```bash +sudo nginx -s reload +``` + +4. Ensure that you disable `Source/Destination Checking` on your load balancer's host if you are working on AWS. + +### Test Your Setup +1. Send UDP messages to the load balancer and verify that they are correctly received in Splunk with the correct host IP: +```bash +echo "hello world" > /dev/udp//514 +``` + +2. Run performance tests + +| Receiver / Drops Rate for EPS (msgs/sec) | 4,500 | 9,000 | 27,000 | 50,000 | 150,000 | 300,000 | +|------------------------------------------|--------|--------|--------|--------|---------|---------| +| Single SC4S Server | 0.33% | 1.24% | 52.31% | 74.71% | -- | -- | +| Load Balancer + 2 Servers | 1% | 1.19% | 6.11% | 47.64% | -- | -- | +| Single Finetuned SC4S Server | 0% | 0% | 0% | 0% | 47.37% | -- | +| Load Balancer + 2 Finetuned Servers | 0.98% | 1.14% | 1.05% | 1.16% | 3.56% | 55.54% | + +Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in minimizing UDP drops on the load balancer side, contact the Nginx support team. \ No newline at end of file diff --git a/docs/architecture/performance-tests.md b/docs/architecture/performance-tests.md new file mode 100644 index 0000000000..f86215d670 --- /dev/null +++ b/docs/architecture/performance-tests.md @@ -0,0 +1,129 @@ +# Performance Tests + +## Run Your Own Performance Tests +The performance of the log ingestion system depends on several custom factors: +- Protocols (UDP/TCP/TLS) +- Network bandwidth between the source, syslog server, and backend +- Number of Splunk indexers and third-party SIEMs, including their number and capacity +- SC4S host's hardware specifications and software configurations +- The number of syslog sources, the size of their logs, and whether they are well-formed and syslog compliant +- Customizations + +Since actual performance heavily depends on these custom factors, the SC4S team cannot provide general estimates. Therefore, you will need to conduct your own performance tests. + +## When to Run Performance Tests +- To estimate single-instance capacity. The size of the instance must be larger than the absolute anticipated input data peak to prevent data loss. +- To compare different hardware setups. +- To evaluate the impact of updating the SC4S configuration on performance. + +## Install Loggen +Loggen is a testing utility distributed with syslog-ng and is also available in SC4S. + +### Example: Install Loggen through syslog-ng +Refer to your syslog-ng documentation for installation instructions. For example, for Ubuntu: + +```bash +wget -qO - https://ose-repo.syslog-ng.com/apt/syslog-ng-ose-pub.asc | sudo apt-key add - +# Update distribution name +echo "deb https://ose-repo.syslog-ng.com/apt/ stable ubuntu-noble" | sudo tee -a /etc/apt/sources.list.d/syslog-ng-ose.list + +apt-get update +apt-get install syslog-ng-core +``` + +```bash +loggen -help +Usage: + loggen [OPTION?] target port +``` + +### Example: Use from Your SC4S Container +```bash +sudo podman exec -it SC4S bash +loggen --help +Usage: + loggen [OPTION*] target port +``` + +# Choose Your Hardware +Here is a reference example of performance testing using our lab configuration on various types of AWS EC2 machines. + +## Tested Configuration +* Loggen (syslog-ng 3.25.1) - m5zn.3xlarge +* SC4S(2.30.0) + podman (4.0.2) - m5zn family +* SC4S_DEST_SPLUNK_HEC_DEFAULT_WORKERS=10 (default) +* Splunk Cloud Noah 8.2.2203.2 - 3SH + 3IDX + +## Command +```bash +/opt/syslog-ng/bin/loggen -i --rate=100000 --interval=1800 -P -F --sdata="[test name=\"stress17\"]" -s 800 --active-connections=10 +``` + +| SC4S instance | root networking | slirp4netns networking | +|---------------|---------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| +| m5zn.large | average rate = 21109.66 msg/sec, count=38023708, time=1801.25, (average) msg size=800, bandwidth=16491.92 kB/sec | average rate = 20738.39 msg/sec, count=37344765, time=1800.75, (average) msg size=800, bandwidth=16201.87 kB/sec | +| m5zn.xlarge | average rate = 34820.94 msg/sec, count=62687563, time=1800.28, (average) msg size=800, bandwidth=27203.86 kB/sec | average rate = 35329.28 msg/sec, count=63619825, time=1800.77, (average) msg size=800, bandwidth=27601.00 kB/sec | +| m5zn.2xlarge | average rate = 71929.91 msg/sec, count=129492418, time=1800.26, (average) msg size=800, bandwidth=56195.24 kB/sec | average rate = 70894.84 msg/sec, count=127630166, time=1800.27, (average) msg size=800, bandwidth=55386.60 kB/sec | +| m5zn.2xlarge | average rate = 85419.09 msg/sec, count=153778825, time=1800.29, (average) msg size=800, bandwidth=66733.66 kB/sec | average rate = 84733.71 msg/sec, count=152542466, time=1800.26, (average) msg size=800, bandwidth=66198.21 kB/sec | + +# Watch Out for Queues +While comparing loggen results can be sufficient for A/B testing, it is not enough to accurately estimate the syslog ingestion throughput of the entire system. + +In the following example, loggen was able to send 4.3 mln messages in one minute; however, Splunk indexers required an additional two minutes to process these messages. During that time, SC4S processed the messages and stored them in a queue while waiting for the HEC endpoint to accept new batches. + +| Splunk Indexers | Total Processing Time (4.3 mln messages) | Estimated Max EPS | +|-----------------|------------------------------------------|-------------------| +| 3 | 3 min | 22K | +| 30 | 1 min (no delay) | 72K | + +When running your tests, make sure to monitor the queues. The easiest way to do this is by accessing your server’s SC4S container and running: +```bash +watch "syslog-ng-ctl stats | grep '^dst.\+\(processed\|queued\|dropped\|written\)'" +``` + +If the destination is undersized or connections are slow, the number of queued events will increase, potentially reaching thousands or millions. Buffering is an effective solution for handling temporary data peaks, but constant input overflows will eventually fill up the buffers, leading to disk or memory issues or dropped messages. Ensure that you assess your SC4S capacity based on the number of messages that can be processed without putting undue pressure on the buffers. + +# Check Your TCP Performance +Run the following command: +``` +loggen --interval 60 --rate 120000 -s 800 --no-framing --inet --active-connections=10 514 +``` +Over a span of 60 seconds, loggen will establish 10 concurrent TCP connections to SC4S and attempt to generate up to 120,000 messages per second for each connection, with each message being 800 bytes in size. The more efficient the SC4S instance, the higher the average rate. + +Example results: + +* Loggen - c5.2xlarge +* SC4S(3.29.0) + podman - c5.4xlarge +* default configuration +* Splunk Cloud 9.2.2403.105 - 3IDX/30IDX + +| Metric | Default SC4S | Finetuned SC4S | +|--------------|---------------------|---------------------| +| Average Rate | 72,153.75 msg/sec | 115,276.92 msg/sec | + +For more information, refer to [Finetune SC4S for TCP](tcp-optimization.md). + +# Check Your UDP Performance +Run the following command: +```bash +loggen --interval 60 --rate 22000 -s 800 --no-framing --dgram 514 +``` + +Over a span of 60 seconds, loggen will attempt to generate 20,000 logs per second, each 800 bytes in size, which will be sent via UDP. + +After running the command, count the number of events that reached Splunk. Since UDP is a lossy protocol, messages can be lost anywhere along the path. + +| Receiver / Drops Rate for EPS (msgs/sec) | 4,500 | 9,000 | 27,000 | 50,000 | 150,000 | +|------------------------------------------|--------|--------|--------|--------|---------| +| Default SC4S | 0.33% | 1.24% | 52.31% | 74.71% | -- | +| Finetuned SC4S | 0% | 0% | 0% | 0% | 47.37% | + +When running your tests, make sure to verify that Splunk indexed the total number of sent messages without delays. + +In simple setups, where the source sends logs directly to the SC4S server, messages may be dropped from the port buffer. You can check the number of packets that encountered receive errors by running: +```bash +sudo netstat -ausn +``` +The number of errors should match the number of missing messages in Splunk. + +For more details on how to minimize message drops, refer to [Finetune SC4S for UDP](udp-optimization.md) to minimize the drop. \ No newline at end of file diff --git a/docs/architecture/tcp-optimization.md b/docs/architecture/tcp-optimization.md new file mode 100644 index 0000000000..3d1a270122 --- /dev/null +++ b/docs/architecture/tcp-optimization.md @@ -0,0 +1,72 @@ +# Finetune SC4S for TCP Traffic +This section provides guidance on improving SC4S performance by tuning configuration settings. + +### Tested Configuration: +- **Loggen** - c5.2xlarge +- **SC4S** (3.29.0) + podman - c5.4xlarge +- **Splunk Cloud** 9.2.2403.105 - 30IDX + +| Setting | EPS (Events per Second) | +|-------------------------------|-------------------------| +| default | 71,327 | +| SC4S_SOURCE_TCP_SO_RCVBUFF | 99,207 | +| SC4S_ENABLE_PARALLELIZE | 101,700 | +| SC4S_SOURCE_TCP_IW_USE | 115,276 | + +You can apply these settings to your infrastructure to improve SC4S performance. After making adjustments, run the [performance tests](performance-tests.md#check-your-tcp-performance) and retain the changes that result in performance improvements. + +## Finetune Your TCP Buffer +1. Update `/etc/sysctl.conf` + +From default SC4S buffer size: +``` +net.core.rmem_default = 17039360 +net.core.rmem_max = 17039360 +``` + +to 512MB: +``` +net.core.rmem_default = 536870912 +net.core.rmem_max = 536870912 +``` + +And apply changes: +``` +sudo sysctl -p +``` + +2. Update `/opt/sc4s/env_file` +``` +SC4S_SOURCE_TCP_SO_RCVBUFF=536870912 +``` + +3. Restart SC4S + +## Parallelize TCP Processing +1. Update `/opt/sc4s/env_file` and restart SC4S. +``` +SC4S_ENABLE_PARALLELIZE=yes +SC4S_PARALLELIZE_NO_PARTITION=4 +``` + +The benefits of using the parallelize mechanism for TCP may be particularly noticeable in production environments with a single high-volume TCP source. This is because parallelize distributes messages from a single TCP stream across multiple concurrent threads. + +| SC4S Parallelize | Loggen TCP Connections | %Cpu(s) us | Average Rate (msg/sec) | +|---------------------|--------------------------------|------------|------------------------| +| off | 1 | 9.0 | 14,144.10 | +| off | 10 | 59.3 | 73,743.32 | +| on (10 threads) | 1 | 58.4 | 77,842.18 | + +## Finetune SC4S IW Size +1. Update `/opt/sc4s/env_file` and restart SC4S. +``` +SC4S_SOURCE_TCP_IW_USE=yes +SC4S_SOURCE_TCP_IW_SIZE=1000000 +``` + +## Switch to SC4S Lite + +Parsing syslog messages is a CPU-intensive task with varying complexity. During the parsing process, each syslog message goes through multiple parsing rules until a match is found. Some log messages follow longer parsing paths than others, and some parsers use regular expressions, which can be slow. + +If you are familiar with your log sources, perform an A/B test and switch to SC4S Lite, which includes only the parsers for your required vendors. While artificial performance tests may not fully capture the impact of this change, you could notice an increase in the capacity of your syslog layer in production environments. + diff --git a/docs/architecture/udp-optimization.md b/docs/architecture/udp-optimization.md new file mode 100644 index 0000000000..8cafb42da7 --- /dev/null +++ b/docs/architecture/udp-optimization.md @@ -0,0 +1,75 @@ +# Finetune SC4S for UDP Traffic +This section demonstrates how SC4S can be vertically scaled by adjusting configuration parameters to significantly reduce UDP packet drops. + +### Tested Configuration: +- **Loggen** - c5.2xlarge +- **SC4S** (3.29.0) + podman - c5.4xlarge +- **Splunk Cloud** 9.2.2403.105 - 30IDX + +| Setup for 67,000 EPS (Events per Second) | % Loss | +|------------------------------------------|--------| +| Default | 77.88 | +| OS Kernel Tuning | 24.38 | +| Increasing the Number of UDP Sockets | 22.95 | +| eBPF | 0 | + +Consider applying these changes to your infrastructure. After each adjustment, run the [performance tests](performance-tests.md#check-your-udp-performance) and retain the changes that result in improvements. + +## Increase OS Kernel + +1. Update `/etc/sysctl.conf` + +Change the default SC4S buffer size from: +```conf +net.core.rmem_default = 17039360 +net.core.rmem_max = 17039360 +``` + +to 512MB: +```conf +net.core.rmem_default = 536870912 +net.core.rmem_max = 536870912 +``` + +And apply changes: +```bash +sudo sysctl -p +``` + +2. Update `/opt/sc4s/env_file` +``` +SC4S_SOURCE_UDP_SO_RCVBUFF=536870912 +``` + +3. Restart SC4S + +## Finetune SC4S UDP Fetch Limit +`/opt/sc4s/env_file` +``` +SC4S_SOURCE_UDP_FETCH_LIMIT=1000000 +``` + +## Finetune SC4S UDP Fetch Limit +`/opt/sc4s/env_file`: +```bash +SC4S_SOURCE_UDP_FETCH_LIMIT=1000000 +``` + +## Increase the Number of UDP Sockets +Update the following setting in `/opt/sc4s/env_file`: +```bash +SC4S_SOURCE_LISTEN_UDP_SOCKETS=32 +``` + +In synthetic performance tests, increasing the number of sockets may not show improvement because all messages originate from a single UDP stream, and they are still processed by only one CPU core. However, if you have multiple UDP sources in your production environment, this feature can provide significant performance improvements. + +## Enable eBPF + +1. Ensure your container is running in privileged mode. +2. Verify that your host supports eBPF. +3. Update the configuration in `/opt/sc4s/env_file`: +```bash +SC4S_SOURCE_LISTEN_UDP_SOCKETS=32 +SC4S_ENABLE_EBPF=yes +SC4S_EBPF_NO_SOCKETS=32 +``` \ No newline at end of file diff --git a/docs/lb.md b/docs/lb.md deleted file mode 100644 index 38522a010f..0000000000 --- a/docs/lb.md +++ /dev/null @@ -1,13 +0,0 @@ -# About using load balancers - -Load balancers are not a best practice for SC4S. The exception to this is a narrow use case where the syslog server is exposed to untrusted clients on the internet, for example, with Palo Alto Cortex. - -## Considerations - -* UDP can only pass a load balancer using DNAT and source IP must be preserved. If you use this configuration, the load balancer becomes a new single point of failure. -* TCP/TLS can use either a DNAT configuration or SNAT with "PROXY" Protocol enabled `SC4S_SOURCE_PROXYCONNECT=yes`. -* TCP/TLS load balancers do not consider the weight of individual connection load and are frequently biased to one instance. Vertically scale all members in a single resource pool to accommodate the full workload. - -## Alternatives - -The best deployment model for high availability is a [Microk8s](https://microk8s.io/) based deployment with MetalLB in BGP mode. This model uses a special class of load balancer that is implemented as destination network translation. \ No newline at end of file diff --git a/docs/performance.md b/docs/performance.md deleted file mode 100644 index 1d3a72b795..0000000000 --- a/docs/performance.md +++ /dev/null @@ -1,56 +0,0 @@ -# Performance and Sizing -Performance testing against our lab configuration produces the following results and limitations. - -## Tested Configurations - -### Splunk Cloud Noah -#### Environment - -* Loggen (syslog-ng 3.25.1) - m5zn.3xlarge -* SC4S(2.30.0) + podman (4.0.2) - m5zn family -* SC4S_DEST_SPLUNK_HEC_DEFAULT_WORKERS=10 (default) -* Splunk Cloud Noah 8.2.2203.2 - 3SH + 3IDX - -```bash -/opt/syslog-ng/bin/loggen -i --rate=100000 --interval=1800 -P -F --sdata="[test name=\"stress17\"]" -s 800 --active-connections=10 -``` -#### Result - -| SC4S instance | root networking | slirp4netns networking | -|---------------|---------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| -| m5zn.large | average rate = 21109.66 msg/sec, count=38023708, time=1801.25, (average) msg size=800, bandwidth=16491.92 kB/sec | average rate = 20738.39 msg/sec, count=37344765, time=1800.75, (average) msg size=800, bandwidth=16201.87 kB/sec | -| m5zn.xlarge | average rate = 34820.94 msg/sec, count=62687563, time=1800.28, (average) msg size=800, bandwidth=27203.86 kB/sec | average rate = 35329.28 msg/sec, count=63619825, time=1800.77, (average) msg size=800, bandwidth=27601.00 kB/sec | -| m5zn.2xlarge | average rate = 71929.91 msg/sec, count=129492418, time=1800.26, (average) msg size=800, bandwidth=56195.24 kB/sec | average rate = 70894.84 msg/sec, count=127630166, time=1800.27, (average) msg size=800, bandwidth=55386.60 kB/sec | -| m5zn.2xlarge | average rate = 85419.09 msg/sec, count=153778825, time=1800.29, (average) msg size=800, bandwidth=66733.66 kB/sec | average rate = 84733.71 msg/sec, count=152542466, time=1800.26, (average) msg size=800, bandwidth=66198.21 kB/sec | - - - - -### Splunk Enterprise -#### Environment - -* Loggen (syslog-ng 3.25.1) - m5zn.large -* SC4S(2.30.0) + podman (4.0.2) - m5zn family -* SC4S_DEST_SPLUNK_HEC_DEFAULT_WORKERS=10 (default) -* Splunk Enterprise 9.0.0 Standalone - -```bash -/opt/syslog-ng/bin/loggen -i --rate=100000 --interval=600 -P -F --sdata="[test name=\"stress17\"]" -s 800 --active-connections=10 -``` -#### Result - -| SC4S instance | root networking | slirp4netns networking | -|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| -| m5zn.large | average rate = 21511.69 msg/sec, count=12930565, time=601.095, (average) msg size=800, bandwidth=16806.01 kB/sec
average rate = 21583.13 msg/sec, count=12973491, time=601.094, (average) msg size=800, bandwidth=16861.82 kB/sec | average rate = 20738.39 msg/sec, count=37344765, time=1800.75, (average) msg size=800, bandwidth=16201.87 kB/sec | -| m5zn.xlarge | average rate = 37514.29 msg/sec, count=22530855, time=600.594, (average) msg size=800, bandwidth=29308.04 kB/sec
average rate = 37549.86 msg/sec, count=22552210, time=600.594, (average) msg size=800, bandwidth=29335.83 kB/sec | average rate = 35329.28 msg/sec, count=63619825, time=1800.77, (average) msg size=800, bandwidth=27601.00 kB/sec | -| m5zn.2xlarge | average rate = 98580.10 msg/sec, count=59157495, time=600.096, (average) msg size=800, bandwidth=77015.70 kB/sec
average rate = 99463.10 msg/sec, count=59687310, time=600.095, (average) msg size=800, bandwidth=77705.55 kB/sec | average rate = 84733.71 msg/sec, count=152542466, time=1800.26, (average) msg size=800, bandwidth=66198.21 kB/sec | - - - -## Guidance on sizing hardware - -* Though vCPU (hyper threading) was used in these examples, syslog processing is a CPU-intensive task and resource oversubscription through sharing is not advised. -* The size of the instance must be larger than the absolute peak to prevent data loss; most sources cannot buffer during traffic congestion. -* CPU Speed is critical; slower or faster CPUs will impact throughput. -* Not all sources are equal in resource utilization. Well-formed Legacy BSD syslog messages were used in this test, but many sources are not syslog compliant and will require additional resources to process. - diff --git a/mkdocs.yml b/mkdocs.yml index 54f426b628..60437868ce 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -33,8 +33,16 @@ theme: nav: - Home: "index.md" - - Architectural Considerations: "architecture.md" - - Load Balancers: "lb.md" + - Architecture and Load Balancers: + - Read First: "architecture/index.md" + - Scaling Solutions: + - Performance Tests: "architecture/performance-tests.md" + - Fine-Tuning: + - TCP Optimization: "architecture/finetune-tcp.md" + - UDP Optimization: "architecture/finetune-udp.md" + - Load Balancers: + - Overview: "architecture/lb/index.md" + - Nginx Open Source: "architecture/lb/nginx-os.md" - Getting Started: - Read First: "gettingstarted/index.md" - Quickstart Guide: "gettingstarted/quickstart_guide.md" @@ -59,7 +67,6 @@ nav: - Read First: "sources/index.md" - Basic Onboarding: "sources/base" - Known Vendors: "sources/vendor" - - Performance: "performance.md" - SC4S Lite (Experimental): - Intro: "lite.md" - Pluggable modules: "pluggable_modules.md" From 4a658b563ab30f555a974a79e0261cb7e68d8c6c Mon Sep 17 00:00:00 2001 From: mstopa-splunk Date: Wed, 11 Sep 2024 05:31:51 +0000 Subject: [PATCH 2/7] Add nginx plus description --- docs/architecture/lb/nginx-os.md | 123 +++++++++++++++++++++++++------ 1 file changed, 101 insertions(+), 22 deletions(-) diff --git a/docs/architecture/lb/nginx-os.md b/docs/architecture/lb/nginx-os.md index 91c73fd73f..7abc82af8e 100644 --- a/docs/architecture/lb/nginx-os.md +++ b/docs/architecture/lb/nginx-os.md @@ -1,23 +1,53 @@ -# Nginx Open Source +# NGINX -This section of the documentation describes the challenges of load balancing syslog traffic using Nginx Open Source. +NGINX is a popular solution, but there are important disadvantages to consider when using it for scaling syslog ingestion: -There are several key disadvantages to using Nginx Open Source for this purpose: -- Nginx Open Source does not provide active health checking, which is essential for UDP DSR (Direct Server Return) load balancing. -- Even with round-robin load balancing, traffic distribution can often be uneven, leading to overloaded instances in the pool. This results in growing queues, causing delays, data drops, and potential memory or disk issues. -- Without High Availability, an Nginx Open Source load balancer becomes a new single point of failure. +- **Uneven TCP traffic distribution**: Even with round-robin load balancing, TCP traffic may not be evenly distributed, leading to overloaded instances. This can cause growing queues, delays, data loss, and potential memory or disk issues. + +- **UDP limitations**: As UDP is a lossy protocol, it’s recommended to send data directly from the source to a nearby syslog server. Using a load balancer can introduce another point of data loss. + +- **Lack of active health checking**: NGINX Open Source does not provide active health checking, which is important for UDP Direct Server Return (DSR) load balancing. NGINX Plus offers active health checking but requires a paid license. + +- **No built-in High Availability (HA)**: NGINX Open Source lacks native support for High Availability. Without HA, your NGINX load balancer could become a single point of failure. NGINX Plus includes built-in HA support, but it is part of the paid offering. -**Please note that Splunk only supports SC4S**. If issues arise due to the load balancer, please reach out to the Nginx support team. -## Install Nginx +**Please note that Splunk only supports SC4S**. If issues arise due to the load balancer, please reach out to the NGINX support team. -1. Refer to the Nginx documentation for instructions on installing Nginx **with the stream module**, which is required for TCP/UDP load balancing. For example, on Ubuntu: +## Install NGINX Open Source + +Refer to the NGINX documentation for instructions on installing NGINX **with the stream module**, which is required for TCP/UDP load balancing. For example, on Ubuntu: ```bash sudo apt update sudo apt -y install nginx libnginx-mod-stream ``` -2. (Optionally) Refer to the Nginx documentation for instructions on fine-tuning Nginx performance. For example, you can update the `events` section in your Nginx configuration file: +## Install NGINX Plus + +Refer to the NGINX documentation for instructions on purchasing the license and installation. For example, on Ubuntu: +```bash +sudo mkdir -p /etc/ssl/nginx + +sudo apt update +sudo apt-get install apt-transport-https lsb-release ca-certificates wget gnupg2 ubuntu-keyring + +# Subscribe to NGINX Plus to obtain the following nginx-repo.key and nginx-repo.crt +sudo cp nginx-repo.key nginx-repo.crt /etc/ssl/nginx/ + +wget -qO - https://cs.nginx.com/static/keys/nginx_signing.key | gpg --dearmor | sudo tee /usr/share/keyrings/nginx-archive-keyring.gpg >/dev/null +printf "deb [signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] https://pkgs.nginx.com/plus/ubuntu `lsb_release -cs` nginx-plus\n" | sudo tee /etc/apt/sources.list.d/nginx-plus.list + +sudo wget -P /etc/apt/apt.conf.d https://cs.nginx.com/static/files/90pkgs-nginx + +sudo apt-get update +sudo apt-get install nginx-plus +``` + +```bash +nginx -v +``` + +## Fine-tune NGINX +2. (Optional) Refer to the NGINX documentation for instructions on fine-tuning NGINX performance. For example, you can update the `events` section in your NGINX configuration file: `/etc/nginx/nginx.conf` ```conf @@ -30,9 +60,9 @@ events { Please note that actual load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. ## Preserving Source IP -The default behavior of Nginx is to overwrite the source IP with the LB's IP. While some users accept this behavior, it is recommended to preserve the original source IP of the message. +By default, NGINX overwrites the source IP with the load balancer's IP. While some users accept this behavior, it is recommended to preserve the original source IP of the message. -Nginx offers three methods to preserve the source IP: +NGINX provides three methods to preserve the source IP: | Method | Protocol | |-----------------------------|------------| @@ -42,10 +72,10 @@ Nginx offers three methods to preserve the source IP: * TLS PROXY protocol support in SC4S is scheduled for implementation. -Examples for setting up Nginx with the PROXY protocol and DSR are provided below. The Transparent IP method requires complex network configuration. For more details, refer to [this Nginx blog post](https://www.f5.com/company/blog/nginx/ip-transparency-direct-server-return-nginx-plus-transparent-proxy). +Examples for setting up NGINX with the PROXY protocol and DSR are provided below. The Transparent IP method requires complex network configuration. For more details, refer to [this NGINX blog post](https://www.f5.com/company/blog/nginx/ip-transparency-direct-server-return-nginx-plus-transparent-proxy). -## Option 1: Configure Nginx Open Source with the PROXY Protocol +## Option 1: Configure NGINX with the PROXY Protocol ### Advantages: - Easy to set up @@ -94,7 +124,7 @@ stream { } ``` -3. Refer to the Nginx documentation to find the command to reload the service, for example: +3. Refer to the NGINX documentation to find the command to reload the service, for example: ```bash sudo nginx -s reload ``` @@ -122,26 +152,28 @@ echo "11 hello world" | netcat 601 | Single SC4S Server | 4,341,000 (71,738.98 msg/sec) | | Load Balancer + 2 Servers | 5,996,000 (99,089.03 msg/sec) | -Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in increasing the TCP throughput of your load balancer instance, contact the Nginx support team. +Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in increasing the TCP throughput of your load balancer instance, contact the NGINX support team. -## Option 2: Configure Nginx with DSR (Direct Server Return) +## Option 2: Configure NGINX with DSR (Direct Server Return) ### Advantages: - Works for UDP -- Saves one hop and additional wrapping +- Reduced latency ### Disadvantages: -- DSR setup requires active health checks because the load balancer cannot expect responses from the upstream. Active health checks are not available in Nginx Open Source, so switch to Nginx Plus or implement your own active health checking. +- DSR setup requires active health checks because the load balancer cannot expect responses from the upstream. Active health checks are not available in NGINX, so switch to NGINX Plus or implement your own active health checking. - Requires superuser privileges. - For cloud users, this might require disabling `Source/Destination Checking` (tested with AWS). -1. In the main Nginx configuration, update the `user` to root: +1. In the main NGINX configuration, update the `user` to root: `/etc/nginx/nginx.conf` ```conf user root; ``` 2. Add a configuration similar to the following in: + +**For NGINX Open Source:** `/etc/nginx/modules-enabled/sc4s.conf` ```conf stream { @@ -165,7 +197,54 @@ stream { } ``` -3. Refer to the Nginx documentation to find the command to reload the service, for example: +**For NGINX Plus:** +1. Add the following configuration to `/etc/nginx/nginx.conf`: +```bash +stream { + # Define upstream for each of SC4S hosts and ports + # Default SC4S UDP port is 514 + # Include your custom ports if applicable + upstream stream_syslog_514 { + zone stream_syslog_514 64k; + server :514; + server :514; + } + + # Define connections to each of your upstreams. + # Ensure to include `proxy_bind` and `proxy_responses 0`. + server { + listen 514 udp; + proxy_pass stream_syslog_514; + + proxy_bind $remote_addr:$remote_port transparent; + + health_check udp; + } +} +``` + +NGINX will actively check the health of your upstream servers by sending UDP messages to port 514. Optionally, you can filter them out at the destination. + +2. (Optional) Add the following local post-filter to each of your SC4S instances to prevent SC4S from sending health check messages to Splunk. +`/opt/sc4s/local/config/app_parsers/nginx_healthcheck-postfiler.conf` +```conf +block parser nginx_healthcheck-postfiler() { + channel { + rewrite(r_set_dest_splunk_null_queue); + }; +}; + +application nginx_healthcheck-postfiler[sc4s-postfilter] { + filter { + "${fields.sc4s_vendor}" eq "splunk" and + "${fields.sc4s_product}" eq "sc4s" + and message('nginx health check' type(string)); + }; + parser { nginx_healthcheck-postfiler(); }; +}; +``` + +3. Refer to the NGINX documentation to find the command to reload the service, for example: ```bash sudo nginx -s reload ``` @@ -187,4 +266,4 @@ echo "hello world" > /dev/udp//514 | Single Finetuned SC4S Server | 0% | 0% | 0% | 0% | 47.37% | -- | | Load Balancer + 2 Finetuned Servers | 0.98% | 1.14% | 1.05% | 1.16% | 3.56% | 55.54% | -Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in minimizing UDP drops on the load balancer side, contact the Nginx support team. \ No newline at end of file +Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in minimizing UDP drops on the load balancer side, contact the NGINX support team. \ No newline at end of file From 13651676271d363f0b71f3e97b4a6ba2d86c080c Mon Sep 17 00:00:00 2001 From: mstopa-splunk Date: Wed, 11 Sep 2024 07:35:19 +0000 Subject: [PATCH 3/7] post-review fixes --- docs/architecture/index.md | 12 ++++--- docs/architecture/lb/index.md | 7 ++-- .../architecture/lb/{nginx-os.md => nginx.md} | 29 +++++++++-------- docs/architecture/performance-tests.md | 32 ++++++++++--------- docs/architecture/tcp-optimization.md | 9 +++--- docs/architecture/udp-optimization.md | 24 ++++++-------- mkdocs.yml | 6 ++-- 7 files changed, 60 insertions(+), 59 deletions(-) rename docs/architecture/lb/{nginx-os.md => nginx.md} (91%) diff --git a/docs/architecture/index.md b/docs/architecture/index.md index ad1203b374..74ae68b2c1 100644 --- a/docs/architecture/index.md +++ b/docs/architecture/index.md @@ -8,9 +8,9 @@ This document outlines recommended architectural solutions, along with alternati While TCP and TLS are supported, UDP remains the dominant protocol for syslog transport in many data centers. Since syslog is a "send and forget" protocol, it performs poorly when routed through complex network infrastructures, including front-end load balancers and WAN. -### Recommendation: Use Edge Collection +The most reliable way to gather syslog traffic is through edge collection rather than centralized collection. When the syslog server is centrally located, UDP and stateless TCP traffic cannot adapt, leading to potential data loss. -The most reliable way to gather syslog traffic is through edge collection rather than centralized collection. If your syslog server is centrally located, UDP and stateless TCP traffic cannot adapt, leading to data loss. +For optimal reliability, deploy SC4S instances in the same VLAN as the source devices. ## Avoid Load Balancing for Syslog @@ -46,9 +46,11 @@ While TCP uses acknowledgement signals (ACKS) to mitigate data loss, issues stil ### When to Use UDP vs. TCP -Use UDP by default for syslog forwarding, switching to TCP for larger syslog events that exceed UDP packet limits (common with Web Proxy, DLP, and IDS sources). +SC4S supports syslog ingestion via UDP, TCP/TLS, or a combination of both, leaving the choice to the system administrator. -The following resources will help you choose the best protocol for your setup: +While UDP can be used by default for syslog forwarding, it’s not mandatory. TCP is often preferable for larger syslog events that exceed UDP packet limits, such as those from Web Proxy, DLP, or IDS sources. + +The following resources can help you determine the best protocol for your setup: 1. [Run performance tests for TCP](performance-tests.md#check-your-tcp-performance) -2. [Run performance tests for UDP](performance-tests.md#check-your-udp-performance) \ No newline at end of file +2. [Run performance tests for UDP](performance-tests.md#check-your-udp-performance) diff --git a/docs/architecture/lb/index.md b/docs/architecture/lb/index.md index 7fc63e2296..5ecf577c29 100644 --- a/docs/architecture/lb/index.md +++ b/docs/architecture/lb/index.md @@ -3,15 +3,16 @@ Be aware of the following issues that may arise from load balancing syslog traffic: - Load balancing for scale can lead to increased data loss due to normal device operations and buffer overflows. - Front-side load balancing often results in uneven data distribution on the upstream side. -- The default behavior of Layer 4 (L4) load balancers is to overwrite the client's source IP with their own. Preserving the real source IP requires additional configuration. +- The default behavior of many load balancers is to overwrite the client's source IP with their own. Preserving the real source IP requires additional configuration. ### Recommendations for Using Load Balancers: - Preserve the actual source IP of the sending device. - Avoid using load balancers without High Availability (HA) mode. - TCP/TLS load balancers often do not account for the load on individual connections and may favor one instance over others. Ensure all members in a resource pool are vertically scaled to handle the full workload. -For **TCP/TLS**, you can use either a DNAT configuration or SNAT with the "PROXY" protocol enabled by setting `SC4S_SOURCE_PROXYCONNECT=yes`. +For **TCP**, you can use either a DNAT configuration or SNAT with the "PROXY" protocol enabled by setting `SC4S_SOURCE_PROXYCONNECT=yes`. For **UDP**, traffic can only pass through a load balancer using DNAT. -This section of the documentation discusses various load balancing solutions and potential configurations, along with known issues. +This section of the documentation discusses various load balancing solutions and example configurations, along with known issues. + Please note that load balancing syslog traffic in front of SC4S is not supported by Splunk, and additional support from the load balancer vendor may be required. \ No newline at end of file diff --git a/docs/architecture/lb/nginx-os.md b/docs/architecture/lb/nginx.md similarity index 91% rename from docs/architecture/lb/nginx-os.md rename to docs/architecture/lb/nginx.md index 7abc82af8e..bced188c75 100644 --- a/docs/architecture/lb/nginx-os.md +++ b/docs/architecture/lb/nginx.md @@ -8,8 +8,7 @@ NGINX is a popular solution, but there are important disadvantages to consider w - **Lack of active health checking**: NGINX Open Source does not provide active health checking, which is important for UDP Direct Server Return (DSR) load balancing. NGINX Plus offers active health checking but requires a paid license. -- **No built-in High Availability (HA)**: NGINX Open Source lacks native support for High Availability. Without HA, your NGINX load balancer could become a single point of failure. NGINX Plus includes built-in HA support, but it is part of the paid offering. - +- **No built-in High Availability (HA)**: NGINX Open Source lacks native support for High Availability. Without HA, your NGINX load balancer becomes a single point of failure. NGINX Plus includes built-in HA support, but it is part of the paid offering. **Please note that Splunk only supports SC4S**. If issues arise due to the load balancer, please reach out to the NGINX support team. @@ -30,7 +29,7 @@ sudo mkdir -p /etc/ssl/nginx sudo apt update sudo apt-get install apt-transport-https lsb-release ca-certificates wget gnupg2 ubuntu-keyring -# Subscribe to NGINX Plus to obtain the following nginx-repo.key and nginx-repo.crt +# Subscribe to NGINX Plus to obtain nginx-repo.key and nginx-repo.crt sudo cp nginx-repo.key nginx-repo.crt /etc/ssl/nginx/ wget -qO - https://cs.nginx.com/static/keys/nginx_signing.key | gpg --dearmor | sudo tee /usr/share/keyrings/nginx-archive-keyring.gpg >/dev/null @@ -145,12 +144,12 @@ echo "hello world" | netcat 514 echo "11 hello world" | netcat 601 ``` -3. Run performance tests based on the [Check TCP Performance](tcp_performance_tests.md) section. +3. Run performance tests based on the [Check TCP Performance](performance-tests.md#check-your-tcp-performance) section. -| Receiver | Performance | -|---------------------------|--------------------------------| -| Single SC4S Server | 4,341,000 (71,738.98 msg/sec) | -| Load Balancer + 2 Servers | 5,996,000 (99,089.03 msg/sec) | +| Receiver | Performance | +|----------------------------|--------------------| +| Single SC4S Server | 71,738.98 msg/sec | +| Load Balancer + 2 Servers | 99,089.03 msg/sec | Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in increasing the TCP throughput of your load balancer instance, contact the NGINX support team. @@ -162,7 +161,7 @@ Please note that load balancer fine-tuning is beyond the scope of the SC4S team' ### Disadvantages: - DSR setup requires active health checks because the load balancer cannot expect responses from the upstream. Active health checks are not available in NGINX, so switch to NGINX Plus or implement your own active health checking. -- Requires superuser privileges. +- Requires switching to the `root` user. - For cloud users, this might require disabling `Source/Destination Checking` (tested with AWS). 1. In the main NGINX configuration, update the `user` to root: @@ -174,6 +173,7 @@ user root; 2. Add a configuration similar to the following in: **For NGINX Open Source:** + `/etc/nginx/modules-enabled/sc4s.conf` ```conf stream { @@ -198,8 +198,9 @@ stream { ``` **For NGINX Plus:** -1. Add the following configuration to `/etc/nginx/nginx.conf`: -```bash + +1. Add the following configuration block to `/etc/nginx/nginx.conf`: +```conf stream { # Define upstream for each of SC4S hosts and ports # Default SC4S UDP port is 514 @@ -211,7 +212,7 @@ stream { } # Define connections to each of your upstreams. - # Ensure to include `proxy_bind` and `proxy_responses 0`. + # Ensure to include `proxy_bind` and `health_check`. server { listen 514 udp; proxy_pass stream_syslog_514; @@ -223,9 +224,9 @@ stream { } ``` -NGINX will actively check the health of your upstream servers by sending UDP messages to port 514. Optionally, you can filter them out at the destination. +NGINX will actively check the health of your upstream servers by sending UDP messages to port 514. -2. (Optional) Add the following local post-filter to each of your SC4S instances to prevent SC4S from sending health check messages to Splunk. +2. (Optional) Add the following local post-filter to each of your SC4S instances to prevent SC4S from forwarding health check messages to Splunk and other destinations. `/opt/sc4s/local/config/app_parsers/nginx_healthcheck-postfiler.conf` ```conf block parser nginx_healthcheck-postfiler() { diff --git a/docs/architecture/performance-tests.md b/docs/architecture/performance-tests.md index f86215d670..7b112a1b6b 100644 --- a/docs/architecture/performance-tests.md +++ b/docs/architecture/performance-tests.md @@ -1,7 +1,8 @@ # Performance Tests -## Run Your Own Performance Tests +### Run Your Own Performance Tests The performance of the log ingestion system depends on several custom factors: + - Protocols (UDP/TCP/TLS) - Network bandwidth between the source, syslog server, and backend - Number of Splunk indexers and third-party SIEMs, including their number and capacity @@ -11,24 +12,25 @@ The performance of the log ingestion system depends on several custom factors: Since actual performance heavily depends on these custom factors, the SC4S team cannot provide general estimates. Therefore, you will need to conduct your own performance tests. -## When to Run Performance Tests +### When to Run Performance Tests - To estimate single-instance capacity. The size of the instance must be larger than the absolute anticipated input data peak to prevent data loss. - To compare different hardware setups. - To evaluate the impact of updating the SC4S configuration on performance. -## Install Loggen +### Install Loggen Loggen is a testing utility distributed with syslog-ng and is also available in SC4S. -### Example: Install Loggen through syslog-ng +#### Example: Install Loggen through syslog-ng Refer to your syslog-ng documentation for installation instructions. For example, for Ubuntu: ```bash wget -qO - https://ose-repo.syslog-ng.com/apt/syslog-ng-ose-pub.asc | sudo apt-key add - + # Update distribution name echo "deb https://ose-repo.syslog-ng.com/apt/ stable ubuntu-noble" | sudo tee -a /etc/apt/sources.list.d/syslog-ng-ose.list -apt-get update -apt-get install syslog-ng-core +sudo apt-get update +sudo apt-get install syslog-ng-core ``` ```bash @@ -37,7 +39,7 @@ Usage: loggen [OPTION?] target port ``` -### Example: Use from Your SC4S Container +#### Example: Use from Your SC4S Container ```bash sudo podman exec -it SC4S bash loggen --help @@ -45,16 +47,16 @@ Usage: loggen [OPTION*] target port ``` -# Choose Your Hardware +## Choose Your Hardware Here is a reference example of performance testing using our lab configuration on various types of AWS EC2 machines. -## Tested Configuration +### Tested Configuration * Loggen (syslog-ng 3.25.1) - m5zn.3xlarge * SC4S(2.30.0) + podman (4.0.2) - m5zn family * SC4S_DEST_SPLUNK_HEC_DEFAULT_WORKERS=10 (default) * Splunk Cloud Noah 8.2.2203.2 - 3SH + 3IDX -## Command +### Command ```bash /opt/syslog-ng/bin/loggen -i --rate=100000 --interval=1800 -P -F --sdata="[test name=\"stress17\"]" -s 800 --active-connections=10 ``` @@ -66,7 +68,7 @@ Here is a reference example of performance testing using our lab configuration o | m5zn.2xlarge | average rate = 71929.91 msg/sec, count=129492418, time=1800.26, (average) msg size=800, bandwidth=56195.24 kB/sec | average rate = 70894.84 msg/sec, count=127630166, time=1800.27, (average) msg size=800, bandwidth=55386.60 kB/sec | | m5zn.2xlarge | average rate = 85419.09 msg/sec, count=153778825, time=1800.29, (average) msg size=800, bandwidth=66733.66 kB/sec | average rate = 84733.71 msg/sec, count=152542466, time=1800.26, (average) msg size=800, bandwidth=66198.21 kB/sec | -# Watch Out for Queues +## Watch Out for Queues While comparing loggen results can be sufficient for A/B testing, it is not enough to accurately estimate the syslog ingestion throughput of the entire system. In the following example, loggen was able to send 4.3 mln messages in one minute; however, Splunk indexers required an additional two minutes to process these messages. During that time, SC4S processed the messages and stored them in a queue while waiting for the HEC endpoint to accept new batches. @@ -83,7 +85,7 @@ watch "syslog-ng-ctl stats | grep '^dst.\+\(processed\|queued\|dropped\|written\ If the destination is undersized or connections are slow, the number of queued events will increase, potentially reaching thousands or millions. Buffering is an effective solution for handling temporary data peaks, but constant input overflows will eventually fill up the buffers, leading to disk or memory issues or dropped messages. Ensure that you assess your SC4S capacity based on the number of messages that can be processed without putting undue pressure on the buffers. -# Check Your TCP Performance +## Check Your TCP Performance Run the following command: ``` loggen --interval 60 --rate 120000 -s 800 --no-framing --inet --active-connections=10 514 @@ -95,7 +97,7 @@ Example results: * Loggen - c5.2xlarge * SC4S(3.29.0) + podman - c5.4xlarge * default configuration -* Splunk Cloud 9.2.2403.105 - 3IDX/30IDX +* Splunk Cloud 9.2.2403.105 - 30IDX | Metric | Default SC4S | Finetuned SC4S | |--------------|---------------------|---------------------| @@ -103,7 +105,7 @@ Example results: For more information, refer to [Finetune SC4S for TCP](tcp-optimization.md). -# Check Your UDP Performance +## Check Your UDP Performance Run the following command: ```bash loggen --interval 60 --rate 22000 -s 800 --no-framing --dgram 514 @@ -126,4 +128,4 @@ sudo netstat -ausn ``` The number of errors should match the number of missing messages in Splunk. -For more details on how to minimize message drops, refer to [Finetune SC4S for UDP](udp-optimization.md) to minimize the drop. \ No newline at end of file +For more details on how to minimize message drops, refer to [Finetune SC4S for UDP](udp-optimization.md). \ No newline at end of file diff --git a/docs/architecture/tcp-optimization.md b/docs/architecture/tcp-optimization.md index 3d1a270122..206419f6e8 100644 --- a/docs/architecture/tcp-optimization.md +++ b/docs/architecture/tcp-optimization.md @@ -15,10 +15,10 @@ This section provides guidance on improving SC4S performance by tuning configura You can apply these settings to your infrastructure to improve SC4S performance. After making adjustments, run the [performance tests](performance-tests.md#check-your-tcp-performance) and retain the changes that result in performance improvements. -## Finetune Your TCP Buffer +## Tune Your Receive Buffer 1. Update `/etc/sysctl.conf` -From default SC4S buffer size: +From default buffer size: ``` net.core.rmem_default = 17039360 net.core.rmem_max = 17039360 @@ -57,7 +57,7 @@ The benefits of using the parallelize mechanism for TCP may be particularly noti | off | 10 | 59.3 | 73,743.32 | | on (10 threads) | 1 | 58.4 | 77,842.18 | -## Finetune SC4S IW Size +## Tune static window size 1. Update `/opt/sc4s/env_file` and restart SC4S. ``` SC4S_SOURCE_TCP_IW_USE=yes @@ -65,8 +65,7 @@ SC4S_SOURCE_TCP_IW_SIZE=1000000 ``` ## Switch to SC4S Lite - Parsing syslog messages is a CPU-intensive task with varying complexity. During the parsing process, each syslog message goes through multiple parsing rules until a match is found. Some log messages follow longer parsing paths than others, and some parsers use regular expressions, which can be slow. -If you are familiar with your log sources, perform an A/B test and switch to SC4S Lite, which includes only the parsers for your required vendors. While artificial performance tests may not fully capture the impact of this change, you could notice an increase in the capacity of your syslog layer in production environments. +If you are familiar with your log sources, consider performing an A/B test and switching to SC4S Lite, which includes only the parsers for the vendors you require. Although artificial performance tests may not fully reflect the impact of this change, you may observe an increase in the capacity of your syslog layer when operating with real-world data. diff --git a/docs/architecture/udp-optimization.md b/docs/architecture/udp-optimization.md index 8cafb42da7..bd2963b815 100644 --- a/docs/architecture/udp-optimization.md +++ b/docs/architecture/udp-optimization.md @@ -15,11 +15,11 @@ This section demonstrates how SC4S can be vertically scaled by adjusting configu Consider applying these changes to your infrastructure. After each adjustment, run the [performance tests](performance-tests.md#check-your-udp-performance) and retain the changes that result in improvements. -## Increase OS Kernel +## Tune Your Receive Buffer 1. Update `/etc/sysctl.conf` -Change the default SC4S buffer size from: +Change the default buffer size from: ```conf net.core.rmem_default = 17039360 net.core.rmem_max = 17039360 @@ -37,36 +37,32 @@ sudo sysctl -p ``` 2. Update `/opt/sc4s/env_file` -``` +```bash SC4S_SOURCE_UDP_SO_RCVBUFF=536870912 ``` 3. Restart SC4S -## Finetune SC4S UDP Fetch Limit -`/opt/sc4s/env_file` -``` -SC4S_SOURCE_UDP_FETCH_LIMIT=1000000 -``` - -## Finetune SC4S UDP Fetch Limit +## Tune UDP Fetch Limit `/opt/sc4s/env_file`: ```bash SC4S_SOURCE_UDP_FETCH_LIMIT=1000000 ``` ## Increase the Number of UDP Sockets -Update the following setting in `/opt/sc4s/env_file`: +`/opt/sc4s/env_file`: ```bash SC4S_SOURCE_LISTEN_UDP_SOCKETS=32 ``` -In synthetic performance tests, increasing the number of sockets may not show improvement because all messages originate from a single UDP stream, and they are still processed by only one CPU core. However, if you have multiple UDP sources in your production environment, this feature can provide significant performance improvements. +In synthetic performance tests, increasing the number of sockets may not show improvement because all messages originate from a single UDP source, and they are still processed by only one CPU core. However, if you have multiple UDP sources in your production environment, this feature can provide significant performance improvements. ## Enable eBPF -1. Ensure your container is running in privileged mode. -2. Verify that your host supports eBPF. +Find more in the [About eBPF](../configuration/#about-ebpf) section. + +1. Verify that your host supports eBPF. +2. Ensure your container is running in privileged mode. 3. Update the configuration in `/opt/sc4s/env_file`: ```bash SC4S_SOURCE_LISTEN_UDP_SOCKETS=32 diff --git a/mkdocs.yml b/mkdocs.yml index 60437868ce..5c963a5928 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -38,11 +38,11 @@ nav: - Scaling Solutions: - Performance Tests: "architecture/performance-tests.md" - Fine-Tuning: - - TCP Optimization: "architecture/finetune-tcp.md" - - UDP Optimization: "architecture/finetune-udp.md" + - TCP Optimization: "architecture/tcp-optimization.md" + - UDP Optimization: "architecture/udp-optimization.md" - Load Balancers: - Overview: "architecture/lb/index.md" - - Nginx Open Source: "architecture/lb/nginx-os.md" + - Nginx Open Source: "architecture/lb/nginx.md" - Getting Started: - Read First: "gettingstarted/index.md" - Quickstart Guide: "gettingstarted/quickstart_guide.md" From fe12c49612ceabe2289ee9d451aa3bdd280954c6 Mon Sep 17 00:00:00 2001 From: mstopa-splunk Date: Wed, 11 Sep 2024 07:52:25 +0000 Subject: [PATCH 4/7] Fix formatting --- docs/architecture/lb/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/lb/index.md b/docs/architecture/lb/index.md index 5ecf577c29..356605af60 100644 --- a/docs/architecture/lb/index.md +++ b/docs/architecture/lb/index.md @@ -1,6 +1,6 @@ # Load Balancers Are Not a Best Practice for SC4S - Be aware of the following issues that may arise from load balancing syslog traffic: + - Load balancing for scale can lead to increased data loss due to normal device operations and buffer overflows. - Front-side load balancing often results in uneven data distribution on the upstream side. - The default behavior of many load balancers is to overwrite the client's source IP with their own. Preserving the real source IP requires additional configuration. From de3483c8c3a4825678dae8862d4d6d0caf86026f Mon Sep 17 00:00:00 2001 From: mstopa-splunk Date: Wed, 11 Sep 2024 07:53:34 +0000 Subject: [PATCH 5/7] Fix nginx link --- mkdocs.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yml b/mkdocs.yml index 5c963a5928..1209127bba 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -42,7 +42,7 @@ nav: - UDP Optimization: "architecture/udp-optimization.md" - Load Balancers: - Overview: "architecture/lb/index.md" - - Nginx Open Source: "architecture/lb/nginx.md" + - Nginx: "architecture/lb/nginx.md" - Getting Started: - Read First: "gettingstarted/index.md" - Quickstart Guide: "gettingstarted/quickstart_guide.md" From 2c281815d64cc65e4deac5182395596172615945 Mon Sep 17 00:00:00 2001 From: mstopa-splunk Date: Fri, 27 Sep 2024 18:36:49 +0000 Subject: [PATCH 6/7] Update reference link --- docs/architecture/udp-optimization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/udp-optimization.md b/docs/architecture/udp-optimization.md index bd2963b815..042a80edf1 100644 --- a/docs/architecture/udp-optimization.md +++ b/docs/architecture/udp-optimization.md @@ -59,7 +59,7 @@ In synthetic performance tests, increasing the number of sockets may not show im ## Enable eBPF -Find more in the [About eBPF](../configuration/#about-ebpf) section. +Find more in the [About eBPF](../../configuration/#about-ebpf) section. 1. Verify that your host supports eBPF. 2. Ensure your container is running in privileged mode. From 1c06948f8f0ef4435be65da5a3803a781b24ca38 Mon Sep 17 00:00:00 2001 From: mstopa-splunk <139441697+mstopa-splunk@users.noreply.github.com> Date: Thu, 17 Oct 2024 11:02:39 +0200 Subject: [PATCH 7/7] Update nginx.md --- docs/architecture/lb/nginx.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/architecture/lb/nginx.md b/docs/architecture/lb/nginx.md index bced188c75..fabf28cae6 100644 --- a/docs/architecture/lb/nginx.md +++ b/docs/architecture/lb/nginx.md @@ -82,7 +82,7 @@ Examples for setting up NGINX with the PROXY protocol and DSR are provided below ### Disadvantages: - Available only for TCP, not for UDP or TLS - Overwriting the source IP in SC4S is not ideal; the `SOURCEIP` is a hard macro and only `HOST` can be overwritten -- Overwriting the source IP is available only in SC4S versions greater than 3.4.0 +- Overwriting the source IP is available only in SC4S versions greater than 3.31.0 ### Configuration @@ -267,4 +267,4 @@ echo "hello world" > /dev/udp//514 | Single Finetuned SC4S Server | 0% | 0% | 0% | 0% | 47.37% | -- | | Load Balancer + 2 Finetuned Servers | 0.98% | 1.14% | 1.05% | 1.16% | 3.56% | 55.54% | -Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in minimizing UDP drops on the load balancer side, contact the NGINX support team. \ No newline at end of file +Please note that load balancer fine-tuning is beyond the scope of the SC4S team's responsibility. For assistance in minimizing UDP drops on the load balancer side, contact the NGINX support team.