Author: Xueyan Hu
Date: 28/4/2023
This package is a spam email detector. This package can continuesly listen to the email address provided by the user, and print a message when a new email is received, indicating whether the email is spam or normal.
Note that the package works only (at least primarily) for Chinese context. When applied to English context, the performance may worsen greatly.
Download this package by running command line:
pip install SpamEmailDetector
from SpamEmailDetector.EmailListener import EmailListener
myEmail = EmailListener(email=yourEmailAddress,password=yourPasscode,detectorName=nameOfDetector)
Where, yourPasscode
is the authorization password of your email address. If you don't know what yourPasscode
is, then find it from the setting page of your email official website. Note it's not your password for logging in. Parameter detectorName
is the name of spam/normal detector model. Two models are provided in this package: "BOWSpamDetector"
and "TfIdfSpamDetector"
. Enter either String
to select the model in need.
The method .startListening(10)
means the interval of listening, in seconds. That is to say, the program will check if there is a new email in your email box and judge if it is spam or normal email for every 10 seconds.
Once you start listening to your email box, bellow message will display on the screen.
Start Listening:
Listening at 2023-04-28 11:39:51.582455
Listening at 2023-04-28 11:40:02.373163
Listening at 2023-04-28 11:40:12.511100
Once a new email has been detected, a message will note you whether it is a spam or normal email. For example, if it is a normal email:
Listening at 2023-04-28 11:40:22.642931
A new e-mail received!
This is a normal email!
To stop listening to your email box, just stop running the program.
This package also includes two models that can determine whether a given text is more likely to be a spam email or a normal email. Two models included are BOWSpamDetector
and TfIdfSpamDetector
For example, if you want to use BOWSpamDetector
, include code below into your python script:
from SpamEmailDetector.BOWSpamDetector import BOWSpamDetector
# create your instance of the class
myInstance = BOWSpamDetector(normalTextFilePATH='./Dataset/normal.txt',spamTextFilePATH='./Dataset/spam.txt', stopWordsTextFilePATH='./Dataset/stopwords_master/baidu_stopwords.txt')
# For a given text TEXT(String)
TEXT = "..."
vectorizedText = myInstance.countVectorizerModel.transform(TEXT) # transform TEXT into vector
result = myInstance.naiveBayesModel.predict(vectorizedText) # result=0 or 1
If you want to use TfIdfSpamDetector
, replace BOWSpamDetector
with the corresponding name.
First, start listening to the given mail box by running code below:
from SpamEmailDetector.EmailListener import EmailListener
myEmail = EmailListener(
email='xueyanhu******@*****', # hidden
password = '*********', # hidden
myEmail.startListening(10) # Listen to my email box every 10 seconds
Second, send me myself ten emails with my own account every 10 seconds, by running the following code: (Sensitive info is hidden)
In the variable testText
, only number 7 is spam email.
import datetime
from pprint import pprint
from SpamEmailDetector.EmailListener import EmailListener
import zmail
import time
if __name__ == '__main__':
testText= [
"参加线上讲座的开发团队,可在讲座当天报名参与无障碍适配挑战活动,通过审核后我们将邀请你参加 5 月 18 日在上海设计与开发加速器举办的无障碍宣传日线下活动,在线下你将了解到更多无障碍开发技术,以及与其他开发者进行交流和互动。我们还将邀请使用无障碍功能的用户来分享他们的故事,了解 App 是如何赋能他们的日常生活;以及有经验的开发者来分享他们的工程实践,看如何在产品内部推进无障碍适配。你还可获得一对一咨询和深度辅导,获得针对你 App 的无障碍优化建议。",
"结果显示 ** 写入性能最大达到 ** 的 6.7 倍,InfluxDB 的 10.6 倍。此外,** 在写入过程中消耗了最少计算(CPU)资源和磁盘 IO 开销;相同落盘数据规模下,** 存储空间只有 InfluxDB 的 25%,只有 TimescaleDB 的 4%。此外,对于大多数查询类型,** 的性能均优于 InfluxDB 和 TimescaleDB,在 Complex queries 类型的查询中展现出巨大的优势——** 的 Complex queries 查询性能最高达到了 InfluxDB 的 37 倍、 TimescaleDB 的 28.6 倍。",
"在过去一个月美国储户的“存款大迁徙”中,货币市场基金显然成为了大赢家。 面对银行业持续动荡,寻求更高收益率的投资者大批涌入了美国货币市场基金,这导致货币市场基金的资产规模一路飙升至了创纪录的水平。货币市场基金自身具有的避险吸引力和远远超过银行存款的收益率,吸引了大量的投资者。",
"感谢您注册参加**线上外汇交易讲座,本期讲座时间为2023年4月25日北京时间晚 8:30-9:30 pm。 本次讲座将使用腾讯会议,建议您提前安装app。"
testMail = [
{'subject':"NO.{} Email".format(i),'content_text':content} for i,content in enumerate(testText)
server = zmail.server('*******@*****','********')
for i,mail in enumerate(testMail):
pprint('sending Mail: \n{}'.format(mail))
pprint('Mail sent at {}'.format(
if i==7: print('Mail sent is SPAM!')
else: print('Mail sent is NORMAL!')
The result is like: (red boxes are corresponding) an example of normal email
Another example of spam email.
This part includes the details of the package.
Namely, Term Frequency Inversed Document Frequency(TfIdf) and Bag of Words (BOW) model of Chinese sentences are included in the package.
The dataset includes three parts:
: includes the normal emails, 5000 emails in total;SpamEmailDetector/Dataset/spam.txt
: includes the spam emails, 5000 emails in total;SpamEmailDetector/Dataset/stopwords_master
: a file folder containing four different the most frequently used stop words, both for Chinese and English. View theSpamEmailDetector/Dataset/stopwords_master/
for more detail of the four files.
We apply two methods of transforming text data into vectorized data, and use Naive Bayes model to do bi-classification.
As for the super parameters of the final model, we run a grid search of parameters to find the best fitter. We split the trainning and testing data by 66.66% to 33.33%, vectorize the trainining set accordingly, after tokenization with package jieba
and deleting stop words. Then under pre-specified super parameters the model is trained. We use the test set to evaluate the performance of our model.
Note that words appear in the testing set are not included when we test the performance of our model in test set. This is because in a training context, the model doesn't know which word will appear. This assumption also accords to reality.
For TfIdf-NaiveBayes model, the grid search result are as below:
for BOW-NaiveBayes model, the grid search result are as below:
For TfIdf-NaiveBayes, the model has tbe biggest F1 score when the
After selecting the super parameters that best fit, we train the final model on the whole dataset. If you run trainFinalModel
method, then the vectorizer and Naive Bayes model will be stored into two attributes: self.countVectorizerModel
and self.naiveBayesModel
, respectively.