Add more text

rspamd · Jul 6, 2024 · 33a7a71 · 33a7a71
1 parent a608ea7
commit 33a7a71
Showing 1 changed file with 56 additions and 1 deletion.
diff --git a/_posts/2024-07-03-gpt.md b/_posts/2024-07-03-gpt.md
@@ -101,9 +101,64 @@ Elapsed time (seconds) 279.08
 
 Despite its high cost, this advanced model is suitable, for example, for low-traffic personal email. It demonstrates significantly lower error rates compared to GPT-3.5, even with a similar low-quality sample corpus.
 
+### Using GPT to learn Bayes
+
+Another interesting option is to use GPT to supervise Bayes engine learning. In this case, we get the best of two worlds: GPT can work without training and Bayes can afterwards catch up and perform instead of GPT (or at least act as a cost saving option).
+
+So we have tested GPT training Bayes and compared efficiency using the same methodics.
+
+GPT results:
+
+~~~
+Metric               Value
+------------------------------
+True Positives       128
+False Positives      13
+True Negatives       301
+False Negatives      68
+Accuracy             0.84
+Precision            0.91
+Recall               0.65
+F1 Score             0.76
+Classified (%)       97.89
+Elapsed time (seconds) 341.77
+~~~
+
+And here are the results from the Bayes trained by GPT in the previous test iteration:
+
+~~~
+Metric               Value
+------------------------------
+True Positives       19
+False Positives      43
+True Negatives       269
+False Negatives      9
+Accuracy             0.85
+Precision            0.31
+Recall               0.68
+F1 Score             0.42
+Classified (%)       65.26
+Elapsed time (seconds) 29.18
+~~~
+
+As we can see, Bayes is still not very confident in classification and has somehow more FP than GPT. On the other hand, this could be further improved by autolearning and by selecting a better corpus to test (our corpus has, indeed, a lot of HAM emails that look like spam even for a human)
+
 ## Plugin design
 
-TODO
+GPT plugin has the following operation logic:
+
+* It selects messages that qualify several pre-checks:
+  - they must not have any symbols from the `excluded` set (e.g. Fuzzy/Bayes spam/Whitelists)
+  - they must not be apparent ham or spam (e.g. with reject action or with no action with high negative score)
+  - they should have enough text tokens in the meaningful displayed part
+* If a message satisfies the checks, Rspamd selects the displayed part (e.g. HTML) and uses the following content to send to GPT:
+  - text part content as one line string (honoring limits if necessary)
+  - message's subject
+  - displayed from
+  - some information about URLs (e.g. domains)
+* This data is also merged with a prompt to GPT that orders to evaluate a probability of such an email to be spam, and output the result as JSON (other output types can sometimes allow GPT to use a human readable text that is very difficult to parse)
+* After all these operations, a corresponding symbol with confidence score is inserted
+* If autolearning is enabled, then Rspamd also learns the supervised classifier (meaning Bayes)
 
 ## Pricing considerations