Code Monkey home page Code Monkey logo

spam-filter's Introduction

实验目标

  • 掌握 AC 字符串匹配算法的双数组实现
  • 掌握邮件附件提取和文档文件内容提取调用方法
  • 掌握中文编码格式

实验内容

  • 阅读简单的垃圾邮件检测程序 代码,代码路径为 user-qjz/spam-filter

  • 修改邮件数据集中部分数据(normal 和 spam 均可),数据内容包含待检 测关键词,“代开安装发票”;

  • 实现 AC 字符串匹配算法,待匹配模式集为{代开安装发票,代开普通发 票,代开商品发票,代开国税发票,代开地税发票,代开广告发票,代 开运输发票,代开租赁发票,代开维修发票,代开建筑发票,代开安装 发票,代开餐饮发票,代开服务发票,代办警官证,办理假证件,代办 学位证,办理毕业证,办证件文凭,诚信办文凭,办证件刻章,办理身 份证,刻章办证件,办四六级证,办上网文凭};

  • 定位代码中的邮件内容解析函数,提取邮件正文内容,调用 AC 匹配函数, 检测内容是否含有模式集中字符串,返回匹配结果,输出为邮件正文和 匹配结果;

  • 【选做,酌情加分】实现邮件中附件的提取,对附件的文件类型为 docx、 pptx、xlsx、pdf 调用 python 的文件内容解析函数,提取内容,然后调用 AC 匹配函数,检测内容是否含有模式集中字符串,返回匹配结果,输 出为邮件附件和匹配结果;

  • 【选做,酌情加分】扩展 UI,显示匹配结果。

项目结构

get_mail.py 在原项目基础上修改,完成从qq邮箱IMAP服务器获取邮件,提取邮件正文到email_test_input.txt、附件到/attachments目录,log文件存入/logs/email_fetch.log

changeFormat.py 来自原项目,用于遍历normalspam目录,提取汉字内容到chineseoutput.txt

ac.py 实现AC匹配算法

attachmentAC.py/attachments 目录下文件类型为 docx、 pptx、xlsx、pdf 的附件提取内容,并调用AC匹配算法,log文件存入/logs/attachmentAC.log

main.py 完成对chineseoutput.txt 的逐行匹配,并创建GUI界面显示结果

项目运行

# 创建并配置虚拟环境
conda create -n spamFilter python=3.9
conda activate spamFilter

pip install -r requirements.txt

# 运行get_mail.py(需先修改文件中用户名密码),输出email_test_input.txt,log位于/logs/email_fetch.log
python get_mail.py

# 运行changeFormat.py,输出chineseoutput.txt
python changeFormat.py

# 运行main.py,完成对chineseoutput.txt的分析,并以图形化界面显示结果
python main.py

# 运行attachmentAC.py,完成对附件的分析,log位于/logs/attachmentAC.log
python attachmentAC.py

spam-filter's People

Contributors

user-qjz avatar windsland52 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.