Code Monkey home page Code Monkey logo

smp2018's Introduction

SMP 2018 (1st prize)

This contest is to distinguish human writing or robot writing from articles, and we won the champion out of 240 teams.

Task description

Given an article, we need to create algorithms that judge types of authors (automatic summary, machine translation, robot writer or human writer). More details see SMP EUPT 2018

1.Set up

  • tensorflow >= 1.4.0
  • keras >= 1.2.0
  • gensim
  • scikit-learn
    you may need keras.utils.vis_utils for model visualization

2.Data Preprocessing

  • my_utils/: for data preprocessing
    • my_utils/data: convert origin data to csv file
    • my_utils/data_preprocess: create data sequences and batches for the input of deep learning models
    • my_utils/w2v_process: get the vocabs and pre-trained embeddings for words and chars
    • my_utils/metrics: calcuate the precision, recall and F1 scores for each categories of authors

3.Models

There are total 12 models that combine word representations and character representations. The best model word rcnn char cgru we devised is spired by two papers:

Here is the scores of different models:

model off-line on-line
word_char_cnn 0.9888 0.9849
word_char_rnn 0.9894 0.9863
deep_word_char_cnn 0.9887 0.9828
word_rcnn_char_rnn 0.9899 0.9879
word_rnn_char_rcnn 0.9902 0.9872
word_char_cgru 0.9896 0.9861
word_cgru_char_rcnn 0.9904 untested
word_rcnn_char_cgru 0.9910 0.9882
word_cgru_char_rnn 0.9887 untested
word_rnn_char_cgru 0.9899 untested
word_rnn_char_cnn 0.9897 0.9862
word_char_rcnn 0.9894 0.9884
  • Note that rcnn comes from A Hybrid Framework for Text Modeling with Convolutional RNN while cgru comes from A C-LSTM Neural Network for Text Classfication


The source codes derives from https://github.com/fuliucansheng/360
We use model to create the architectures of models, and use train to train them

4.Ensemble


We use LightGBM for ensemble combined 12 models and extra statistical features, which is in ensemble, more details seen in https://github.com/TFknight/SMP-2018-Ensemble-Guide
In test dataset, we only adopt a simple but efficient voting mechanism for ensembling, which is in evaluate/predict

5.Main files

  • my_utils/: for data preprocessing
    • my_utils/data: convert origin data to csv file
    • my_utils/data_preprocess: create data sequences and batches for the input of deep learning models
    • my_utils/w2v_process: get the vocabs and pre-trained embeddings for words and chars
    • my_utils/metrics: calcuate the precision, recall and F1 scores for each categories of authors
  • models/: for creating deep learning models
    • deepzoo: for keeping all models
  • init/config.py: for saving the path of models, data and so on
  • train: for training models
  • figure: for saving the visualization of models

Acknowledgment


Thanks for all the efforts of my teammates in GDUFS-iiip
We hope that more people will join in our labs: Data Mining Lab in GDUFS(广外数据挖掘实验室)

smp2018's People

Contributors

quincy1994 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.