Skip to content

Commit

Permalink
Start post about GPT
Browse files Browse the repository at this point in the history
  • Loading branch information
vstakhov committed Jul 3, 2024
1 parent 04baedc commit ba12c67
Showing 1 changed file with 38 additions and 0 deletions.
38 changes: 38 additions & 0 deletions _posts/2024-07-03-gpt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
layout: post
title: "Rspamd and GPT integration"
categories: misc
---

## Preface

Historically, we had only Bayes as the text classification method. Bayes is still quite a powerful statistical method that can grant quite a decent performance with enough of learning. There are two main disadvantages of the Bayes:

* It requires a lot of balanced and well designed learning
* It cannot work when there is not enough confidence, especially for high variety of spam

Using of a large language models (LLM) can help address both of the issues, as such models can perform deep intorspection with some sort of "understanding" of the context. However, LLM models require quite a lot of computational resources (usually GPU), so it is not practical to scan all email via them. It is also quite beneficial to run these models separately from a scan engine to avoid resources racing.

## Rspamd GPT plugin

In Rspamd 3.9, I have tried to play with OpenAI API and decide if it is useful in spam filtering or not. Here are some basic ideas behind this plugin:

* We select a displayed text part, extract text from it and ask GPT API for probability of it to be spam
* We also add some more information from the message, such as subject, displayed from, url information
* Then we ask GPT to make JSON output as we can parse JSON and we cannot parse human readable GPT output (in general)
* We exclude some specific symbols from being scan on GPT (e.g. `BAYES_SPAM`, `FUZZY_DENIED` as well as `REPLY` and other similar symbols)
* We also exclude apparent spam and ham from the checks

The former two points is done to reduce GPT load for something that we already know about and there is nothing that GPT can add in the evaluation. We also use GPT as one of the classifiers, meaning that we do not rely on GPT evaluation only.

## Evaluation results

TBD

## Pricing concerns

TBD

## Conclusions

TBD

0 comments on commit ba12c67

Please sign in to comment.