diff --git a/_posts/2024-07-03-gpt.md b/_posts/2024-07-03-gpt.md index 11ad37164..8484c83dd 100644 --- a/_posts/2024-07-03-gpt.md +++ b/_posts/2024-07-03-gpt.md @@ -18,7 +18,7 @@ Large Language Models (LLMs) offer promising solutions to these challenges. Thes In Rspamd 3.9, I have tried to integrate the OpenAI GPT API for spam filtering and assess its usefulness. Here are some basic ideas behind this plugin: * The selected displayed text part is extracted and submitted to the GPT API for spam probability assessment -* Additional message details such as Subject, Display From, and URLs are also included in the assessment +* Additional message details such as Subject, displayed From, and URLs are also included in the assessment * Then we ask GPT to provide results in JSON format since human-readable GPT output cannot be parsed (in general) * Some specific symbols (`BAYES_SPAM`, `FUZZY_DENIED`, `REPLY`, etc.) are excluded from the GPT scan * Obvious spam and ham are also excluded from the GPT evaluation @@ -101,11 +101,11 @@ Elapsed time (seconds) 279.08 Despite its high cost, this advanced model is suitable, for example, for low-traffic personal email. It demonstrates significantly lower error rates compared to GPT-3.5, even with a similar low-quality sample corpus. -### Using GPT to learn Bayes +### Using GPT to train Bayes classifier -Another interesting option is to use GPT to supervise Bayes engine learning. In this case, we get the best of two worlds: GPT can work without training and Bayes can afterwards catch up and perform instead of GPT (or at least act as a cost saving option). +Another interesting approach involves using GPT to supervise Bayes engine training. In this case, we benefit from the best of both worlds: GPT can operate without training, while Bayes can catch up afterward and perform instead of GPT (or at least serve as a cost-saving alternative). -So we have tested GPT training Bayes and compared efficiency using the same methodics. +So we tested GPT training Bayes and compared efficiency using the same methodologies. GPT results: @@ -124,7 +124,7 @@ Classified (%) 97.89 Elapsed time (seconds) 341.77 ~~~ -And here are the results from the Bayes trained by GPT in the previous test iteration: +Bayes classifier results (trained by GPT in the previous test iteration): ~~~ Metric Value @@ -141,31 +141,31 @@ Classified (%) 65.26 Elapsed time (seconds) 29.18 ~~~ -As we can see, Bayes is still not very confident in classification and has somehow more FP than GPT. On the other hand, this could be further improved by autolearning and by selecting a better corpus to test (our corpus has, indeed, a lot of HAM emails that look like spam even for a human) +Bayes still exhibits uncertainty in classification, with more false positives than GPT. Improvement could be achieved through autolearning and by refining the corpus used for testing (our corpus contains many ham emails that look like spam even for human evaluators). ## Plugin design -GPT plugin has the following operation logic: +The GPT plugin operates as follows: * It selects messages that qualify several pre-checks: - - they must not have any symbols from the `excluded` set (e.g. Fuzzy/Bayes spam/Whitelists) - - they must not be apparent ham or spam (e.g. with reject action or with no action with high negative score) + - they must not contain any symbols from the `excluded` set (e.g. Fuzzy/Bayes spam/Whitelists) + - they must not clearly appear as ham or spam (e.g. with reject action or no action with a high negative score) - they should have enough text tokens in the meaningful displayed part -* If a message satisfies the checks, Rspamd selects the displayed part (e.g. HTML) and uses the following content to send to GPT: - - text part content as one line string (honoring limits if necessary) - - message's subject - - displayed from - - some information about URLs (e.g. domains) -* This data is also merged with a prompt to GPT that orders to evaluate a probability of such an email to be spam, and output the result as JSON (other output types can sometimes allow GPT to use a human readable text that is very difficult to parse) -* After all these operations, a corresponding symbol with confidence score is inserted -* If autolearning is enabled, then Rspamd also learns the supervised classifier (meaning Bayes) +* If a message satisfies these checks, Rspamd selects the displayed part (e.g. HTML) and sends the following content to GPT: + - text part content as a single-line string (honoring limits if necessary) + - message subject + - displayed From + - some details about URLs (e.g. domains) +* This data is merged with a prompt to GPT requesting an evaluation of the email's spam probability, with the output returned in JSON format (other output types may sometimes allow GPT to provide human-readable text that is very difficult to parse) +* After these steps, a corresponding symbol with a confidence score is inserted +* With autolearning enabled, Rspamd also trains the supervised classifier (Bayes) ## Pricing considerations and conclusions -OpenAI provides API for the requests and it costs some money (there is no free tier so far). However, if you plan to use it for a personal email or if you just want to train your Bayes without manual classification, GPT might be a good option to consider. As a concrete example, for my personal email (that is quite a loaded one), the cost of gpt-3.5 is around $0.05 per day (for like 100k tokens). +OpenAI provides an API for these requests, incurring costs (currently no free tier available). However, for personal email usage or automated Bayes training without manual intervention, GPT presents a viable option. For instance, processing a substantial volume of personal emails with GPT-3.5 costs approximately $0.05 daily (for about 100k tokens). -For the large scale email systems, it is probably better to get some other LLM (e.g. llama) and use it internally on a system with some GPU power. The existing plugin is designed to work with other LLM types without significant modifications. This method has also another advantage by providing more privacy of your data as you do not send the content of your emails to some 3-rd party service, such as OpenAI (however, they claim that their models are not learned on API requests). +For large-scale email systems, it may be preferable to use another LLM (e.g. llama) internally on a GPU-powered platform. The current plugin is designed to integrate with different LLM types without significant modifications. This approach also enhances data privacy by avoiding sending email content to a third-party service (though OpenAI claims their models do not learn from API requests). -Despite of not being 100% accurate, GPT plugin provides efficiency that is roughly about the efficiency of a human filtered email. We plan to work further on accuracy improvements by adding more metadata to GPT engine trying to stay efficient in terms of tokens usage. There are other plans for better usage of the LLM knowlege in Rspamd, for example, for better fine-grained classification. +Despite not achieving 100% accuracy, the GPT plugin demonstrates efficiency comparable to human-filtered email. Future enhancements will focus on improving accuracy through additional metadata integration into the GPT engine, while optimizing token usage efficiency. There are also plans to better utilize LLM knowledge in Rspamd, particularly for better fine-grained classification. -GPT plugin will be available from Rspamd 3.9, and you still need to apply for API key from OpenAI (and invest some funds there) to use the ChatGPT. \ No newline at end of file +The GPT plugin will be available starting from Rspamd 3.9, requiring an OpenAI API key and financial commitment for accessing ChatGPT services.