Start post about GPT

rspamd · Jul 3, 2024 · ba12c67 · ba12c67
1 parent 04baedc
commit ba12c67
Showing 1 changed file with 38 additions and 0 deletions.
diff --git a/_posts/2024-07-03-gpt.md b/_posts/2024-07-03-gpt.md
@@ -0,0 +1,38 @@
+---
+layout: post
+title:  "Rspamd and GPT integration"
+categories: misc
+---
+
+## Preface
+
+Historically, we had only Bayes as the text classification method. Bayes is still quite a powerful statistical method that can grant quite a decent performance with enough of learning. There are two main disadvantages of the Bayes:
+
+* It requires a lot of balanced and well designed learning
+* It cannot work when there is not enough confidence, especially for high variety of spam
+
+Using of a large language models (LLM) can help address both of the issues, as such models can perform deep intorspection with some sort of "understanding" of the context. However, LLM models require quite a lot of computational resources (usually GPU), so it is not practical to scan all email via them. It is also quite beneficial to run these models separately from a scan engine to avoid resources racing.
+
+## Rspamd GPT plugin
+
+In Rspamd 3.9, I have tried to play with OpenAI API and decide if it is useful in spam filtering or not. Here are some basic ideas behind this plugin:
+
+* We select a displayed text part, extract text from it and ask GPT API for probability of it to be spam
+* We also add some more information from the message, such as subject, displayed from, url information
+* Then we ask GPT to make JSON output as we can parse JSON and we cannot parse human readable GPT output (in general)
+* We exclude some specific symbols from being scan on GPT (e.g. `BAYES_SPAM`, `FUZZY_DENIED` as well as `REPLY` and other similar symbols)
+* We also exclude apparent spam and ham from the checks
+
+The former two points is done to reduce GPT load for something that we already know about and there is nothing that GPT can add in the evaluation. We also use GPT as one of the classifiers, meaning that we do not rely on GPT evaluation only.
+
+## Evaluation results
+
+TBD
+
+## Pricing concerns
+
+TBD
+
+## Conclusions
+
+TBD