From e287bbc768979e115ea96cd1197e029a199a0a56 Mon Sep 17 00:00:00 2001
From: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
Date: Thu, 23 May 2024 09:44:19 +0200
Subject: [PATCH] Add information about Bayes per-user sharding (#749)

* Update statistic.md

---------

Co-authored-by: Alexander Moisseev <moiseev@mezonplus.ru>
---
 doc/configuration/statistic.md | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/doc/configuration/statistic.md b/doc/configuration/statistic.md
index 3198a40bb..8cc1d6014 100644
--- a/doc/configuration/statistic.md
+++ b/doc/configuration/statistic.md
@@ -86,6 +86,28 @@ To enable per-user statistics, you can add the `per_user = true` property to the
 
 It's worth noting that Rspamd prioritizes SMTP recipients over MIME ones and gives preference to the special LDA header called `Delivered-To`, which can be appended using the `-d` option for `rspamc`. This allows for more accurate per-user statistics in your configuration.
 
+#### Sharding
+
+Starting from version 3.9, per-user statistics can be sharded across different Redis servers using the [hash algorithm]({{ site.baseurl }}/doc/configuration/upstream.html#hash-algorithm).
+
+Example of using two stand-alone master shards without read replicas:
+~~~hcl
+servers = "hash:bayes-peruser-0-master,bayes-peruser-1-master";
+~~~
+
+Example of using a setup with three master-replica shards:
+~~~hcl
+write_servers = "hash:bayes-peruser-0-master,bayes-peruser-1-master,bayes-peruser-2-master";
+read_servers = "hash:bayes-peruser-0-replica,bayes-peruser-1-replica,bayes-peruser-2-replica";
+~~~
+
+Important notes:
+1. Changing the shard count requires dropping all Bayes statistics, so please make decisions wisely.
+2. Each replica should have the same position in `read_servers` as its master in `write_servers`; otherwise, this will result in misaligned read-write hash slot assignments.
+3. You can't use more than one replica per master in a sharded setup; this will result in misaligned read-write hash slot assignments.
+4. Redis Sentinel cannot be used for a sharded setup.
+5. In the controller, you will see incorrect `Bayesian statistics` for the count of learns and users.
+
 ### Classifier and headers
 
 The classifier in Rspamd learns headers that are specifically defined in the `classify_headers` section of the `options.inc `file. Therefore, there is no need to remove any additional headers (e.g., X-Spam) before the learning process, as these headers will not be utilized for classification purposes. Rspamd also takes into account the `Subject` header, which is tokenized according to the aforementioned rules. Additionally, Rspamd considers various meta-tokens, such as message size or the number of attachments, which are extracted from the messages for further analysis.