Light

stanleyyang1987 / textclassify Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yanzhongli/textclassify

0.0 2.0 0.0 32.02 MB

此文本分类项目主要面向机器学习初学者和文本分类效果测试者，项目内部含有朴素贝叶斯，余弦定理，逻辑回归多种分类算法以及mm，rmm分词器，同时从某新闻站点爬取了多个分类共6000多篇文章，以及一个中文词典。项目方便自由拓展各种分类器和分词器，并通过组装测试分类效果。

Java 100.00%

textclassify's Introduction

文本分类多种算法的实现与效果测试

项目说明

项目主要面向机器学习初学者或是各文本分类算法的效果测试者，下载项目后即可直接运行测试，查看各个分类算法以及中文分词的效果，效果由运行结果中的预测准确率体现；
为了更清晰的看到各个分类算法的实现，项目没有引入其他机器学习方向的jar包，所有算法原理实现都在项目代码中查看即可。
项目使用工厂模式组织，可以很方便的拓展分类器和分词器，并且查看效果。

材料说明

./seeds文件夹中包含了17个不同分类的训练集，每个分类都包含数百篇新闻（抓取自某新闻网站，同一分类取自该站点同一栏目下的文章），合计共6000多篇新闻。如需添加或替换新的训练集，只需按照同一层级放置文件即可。

./dic文件夹中包含一个中文词典文件，词典含有45万+个中文词语。

使用及部分代码说明

com.fmyblack.ClassifyTest为入口类，main方法中完成了将所有文本按一定比例随机分为训练集，测试集，使用工厂类获取对应的分类器，训练分类器，使用分类器测试测试集获得分类效果；
com.fmyblack.ClassifierFactory为工厂类，使用getClassifyModel方法组装分类算法和分词算法即可获得分类器；
项目目前包含朴素贝叶斯（com.fmyblack.textClassify.naiveBaye），逻辑回归（com.fmyblack.textClassify.lr），余弦定理（com.fmyblack.textClassify.cosine）多个分类模型，也包含逆向最大匹配（com.fmyblack.word.rmm）多个分词算法，测试效果时可自由组装。

拓展说明

如需添加新的分类算法，请继承com.fmyblack.textClassify.ClassifyModel接口；
如需添加新的分词算法，请继承com.fmyblack.word.WordSegmenter接口；
将新的算法在工厂com.fmyblack.ClassifierFactory中注册；
com.fmyblack.textClassify.doc包中实现了对训练集的一些基本操作，com.fmyblack.textClassify.IDF实现了idf算法，可供使用。

textclassify's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.