Skip to content

Commit

Permalink
Improve GPT integration post
Browse files Browse the repository at this point in the history
  • Loading branch information
moisseev committed Jul 4, 2024
1 parent ba12c67 commit c0cacd9
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions _posts/2024-07-03-gpt.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,38 @@
---
layout: post
title: "Rspamd and GPT integration"
title: "Integrating Rspamd with GPT"
categories: misc
---

## Preface

Historically, we had only Bayes as the text classification method. Bayes is still quite a powerful statistical method that can grant quite a decent performance with enough of learning. There are two main disadvantages of the Bayes:
Historically, our only text classification method has been Bayes, a powerful statistical method that performs well with sufficient training. However, Bayes has its limitations:

* It requires a lot of balanced and well designed learning
* It cannot work when there is not enough confidence, especially for high variety of spam
* It requires thorough and well-balanced training
* It cannot work with low confidence levels, especially when dealing with a wide variety of spam

Using of a large language models (LLM) can help address both of the issues, as such models can perform deep intorspection with some sort of "understanding" of the context. However, LLM models require quite a lot of computational resources (usually GPU), so it is not practical to scan all email via them. It is also quite beneficial to run these models separately from a scan engine to avoid resources racing.
Large Language Models (LLMs) offer promising solutions to these challenges. These models can perform deep introspection with some sort of contextual "understanding". However, their high computational demands (typically requiring GPUs) make scanning all emails impractical. Separating LLM execution from the scanning engine mitigates resource competition.

## Rspamd GPT plugin

In Rspamd 3.9, I have tried to play with OpenAI API and decide if it is useful in spam filtering or not. Here are some basic ideas behind this plugin:
In Rspamd 3.9, I have tried to integrate the OpenAI GPT API for spam filtering and assess its usefulness. Here are some basic ideas behind this plugin:

* We select a displayed text part, extract text from it and ask GPT API for probability of it to be spam
* We also add some more information from the message, such as subject, displayed from, url information
* Then we ask GPT to make JSON output as we can parse JSON and we cannot parse human readable GPT output (in general)
* We exclude some specific symbols from being scan on GPT (e.g. `BAYES_SPAM`, `FUZZY_DENIED` as well as `REPLY` and other similar symbols)
* We also exclude apparent spam and ham from the checks
* The selected displayed text part is extracted and submitted to the GPT API for spam probability assessment
* Additional message details such as Subject, Display From, and URLs are also included in the assessment
* Then we ask GPT to provide results in JSON format since human-readable GPT output cannot be parsed (in general)
* Some specific symbols (`BAYES_SPAM`, `FUZZY_DENIED`, `REPLY`, etc.) are excluded from the GPT scan
* Obvious spam and ham are also excluded from the GPT evaluation

The former two points is done to reduce GPT load for something that we already know about and there is nothing that GPT can add in the evaluation. We also use GPT as one of the classifiers, meaning that we do not rely on GPT evaluation only.
The former two points reduce the GPT workload for something that is already known, where GPT cannot add any value in the evaluation. We also use GPT as one of the classifiers, meaning that we do not rely solely on GPT evaluation.

## Evaluation results

TBD

## Pricing concerns
## Pricing considerations

TBD

## Conclusions

TBD
TBD

0 comments on commit c0cacd9

Please sign in to comment.