Update statistics with per-domain example (#752)

Co-authored-by: Alexander Moisseev <[email protected]>
rspamd · May 26, 2024 · 209feb7 · 209feb7
1 parent bd16d3a
commit 209feb7
Showing 1 changed file with 54 additions and 36 deletions.
diff --git a/doc/configuration/statistic.md b/doc/configuration/statistic.md
@@ -80,34 +80,6 @@ For most of setups where there is only one classifier is used - `classifier-baye
 
 If you need describe multiply different classifiers - then you need create `local.d/statistic.conf`, that should describe classifier sections, each classifier **must** have own `name` and have all options from default config, as there will be no fallback. Common usecase for such case is when first classifier is `per_user` and second is not.
 
-### Per-user statistics
-
-To enable per-user statistics, you can add the `per_user = true` property to the configuration of the classifier. However, it is *important* to ensure that Rspamd is called at the final delivery stage (e.g., LDA mode) to avoid issues with multi-recipient messages. When dealing with multi-recipient messages, Rspamd will use the first recipient for user-based statistics. 
-
-It's worth noting that Rspamd prioritizes SMTP recipients over MIME ones and gives preference to the special LDA header called `Delivered-To`, which can be appended using the `-d` option for `rspamc`. This allows for more accurate per-user statistics in your configuration.
-
-#### Sharding
-
-Starting from version 3.9, per-user statistics can be sharded across different Redis servers using the [hash algorithm]({{ site.baseurl }}/doc/configuration/upstream.html#hash-algorithm).
-
-Example of using two stand-alone master shards without read replicas:
-~~~hcl
-servers = "hash:bayes-peruser-0-master,bayes-peruser-1-master";
-~~~
-
-Example of using a setup with three master-replica shards:
-~~~hcl
-write_servers = "hash:bayes-peruser-0-master,bayes-peruser-1-master,bayes-peruser-2-master";
-read_servers = "hash:bayes-peruser-0-replica,bayes-peruser-1-replica,bayes-peruser-2-replica";
-~~~
-
-Important notes:
-1. Changing the shard count requires dropping all Bayes statistics, so please make decisions wisely.
-2. Each replica should have the same position in `read_servers` as its master in `write_servers`; otherwise, this will result in misaligned read-write hash slot assignments.
-3. You can't use more than one replica per master in a sharded setup; this will result in misaligned read-write hash slot assignments.
-4. Redis Sentinel cannot be used for a sharded setup.
-5. In the controller, you will see incorrect `Bayesian statistics` for the count of learns and users.
-
 ### Classifier and headers
 
 The classifier in Rspamd learns headers that are specifically defined in the `classify_headers` section of the `options.inc `file. Therefore, there is no need to remove any additional headers (e.g., X-Spam) before the learning process, as these headers will not be utilized for classification purposes. Rspamd also takes into account the `Subject` header, which is tokenized according to the aforementioned rules. Additionally, Rspamd considers various meta-tokens, such as message size or the number of attachments, which are extracted from the messages for further analysis.
@@ -116,19 +88,21 @@ The classifier in Rspamd learns headers that are specifically defined in the `cl
 
 Supported parameters for the Redis backend are:
 
-- `tokenizer`: leave it as shown for now. Currently, only OSB is supported
+- `name`: unique name of the classifier, must be set when multiply classifiers is defined, otherwise optional
+- `tokenizer`: currently only OSB is supported, must be set as shown in default configuration
 - `new_schema`: must be set to `true`
-- `backend`: set it to Redis
+- `backend`: must be set to `"redis"`
+- `learn_condition`: Lua function that verifies that learning is needed. Default function **must** be set if you not wrote your own, omniting `learn_condition` from `statistic.conf` will lead to loosing protection from overlearning
 - `servers`: IP or hostname with a port for the Redis server. Use an IP for the loopback interface, if you have defined localhost in /etc/hosts for IPv4 and IPv6, or your Redis server will not be found!
-- `write_servers` (optional): If needed, define dedicated servers for learning
-- `password` (optional): Password for the Redis server
-- `db` (optional): Database to use (though it is recommended to use dedicated Redis instances and not databases in Redis)
+- `write_servers` (optional): for write only Redis servers (usually masters)
+- `read_servers` (optional): for read only Redis servers (usually replicas)
+- `password` (optional): password for the Redis server
+- `db` (optional): database to use (though it is recommended to use dedicated Redis instances and not databases in Redis)
 - `min_tokens`: minimum number of words required for statistics processing
 - `min_learns` (optional): minimum learn to count for **both** spam and ham classes to perform classification
-- `learn_condition`: Lua function that verifies that learning is needed. Default function **must** be set if you not wrote your own, omniting `learn_condition` from `statistic.conf` will lead to loosing protection from overlearning
 - `autolearn` (optional): for more details see Autolearning section
-- `per_user` (optional): enable perusers statistics. See above
-- `statfile`: Define keys for spam and ham mails
+- `per_user` (optional): for more details see Per-user statistics section
+- `statfile`: defines keys for spam and ham mails
 - `cache_prefix` (optional): prefix used to create keys where to store hashes of already learned ids, defaults to `"learned_ids"`
 - `cache_max_elt` (optional): amount of elements to store in one `learned_ids` key
 - `cache_max_keys` (optional): amount of `learned_ids` keys to store
@@ -145,3 +119,47 @@ There are three options available for specifying autolearning:
 * `autolearn = "return function(task) ... end"`: use the following Lua function to detect if autolearn is needed (function should return 'ham' if learn as ham is needed and string 'spam' if learn as spam is needed, if no learning is needed then a function can return anything including `nil`)
 
 Redis backend is highly recommended for autolearning purposes due to its ability to handle high concurrency levels when multiple writers are synchronized properly. Using Redis as the backend ensures efficient and reliable autolearning functionality.
+
+### Per-user statistics
+
+To enable per-user statistics, you can add the `per_user = true` property to the configuration of the classifier. However, it is *important* to ensure that Rspamd is called at the final delivery stage (e.g., LDA mode) to avoid issues with multi-recipient messages. When dealing with multi-recipient messages, Rspamd will use the first recipient for user-based statistics. 
+
+Rspamd prioritizes SMTP recipients over MIME ones and gives preference to the special LDA header called `Delivered-To`, which can be appended using the `-d` option for `rspamc`. This allows for more accurate per-user statistics in your configuration.
+
+You can change per-user statistics to per-domain (or any other) by utilizing a Lua function. The function should return the user as a string or `nil` as a fallback. For example:
+~~~lua
+per_user = <<EOD
+return function(task)
+  local rcpt = task:get_recipients('any')
+  if rcpt then
+    local first_rcpt = rcpt[1]
+    if first_rcpt['domain'] then
+      return first_rcpt['domain']
+    end
+  end
+  return nil
+end
+EOD
+~~~
+
+#### Sharding
+
+Starting from version 3.9, per-user statistics can be sharded across different Redis servers using the [hash algorithm]({{ site.baseurl }}/doc/configuration/upstream.html#hash-algorithm).
+
+Example of using two stand-alone master shards without read replicas:
+~~~hcl
+servers = "hash:bayes-peruser-0-master,bayes-peruser-1-master";
+~~~
+
+Example of using a setup with three master-replica shards:
+~~~hcl
+write_servers = "hash:bayes-peruser-0-master,bayes-peruser-1-master,bayes-peruser-2-master";
+read_servers = "hash:bayes-peruser-0-replica,bayes-peruser-1-replica,bayes-peruser-2-replica";
+~~~
+
+Important notes:
+1. Changing the shard count requires dropping all Bayes statistics, so please make decisions wisely.
+2. Each replica should have the same position in `read_servers` as its master in `write_servers`; otherwise, this will result in misaligned read-write hash slot assignments.
+3. You can't use more than one replica per master in a sharded setup; this will result in misaligned read-write hash slot assignments.
+4. Redis Sentinel cannot be used for a sharded setup.
+5. In the controller, you will see incorrect `Bayesian statistics` for the count of learns and users.