Code Monkey home page Code Monkey logo

old-cs-433-project-2-2020-roll_king's Introduction

Complete Sentence Detection for Speech Recognition Systems

Project 2 (EPFL Machine Learning Course CS-433)

This is a repository for all code of project 2

Members:

  • Bohan Wang (321293)

  • Ke Wang (326760)

  • Siran Li (321825)

Datasets

  1. News reports: 143, 000 articles from 15 American publications [1].
  2. Ted 2020 Parallel Sentences Corpus: around 4000 TED Talk transcripts from July 2020 [2].
  3. Wikipedia corpus: over 10 million topics [3].
  4. Topical-Chat: human dialog conversations spanning 8 broad topics [4].

Transformers

Note all below pre-trained transformers are from Hugging face [5].

  1. Generative Pre-trained Transformer 2 (GPT-2) [6].
  2. Bidirectional Encoder Representations from Trans-formers (BERT) [7].
  3. Big Bird: Transformers for Longer Sequences [8].

Notes

The packages used in the project can be installed using:

pip install datasets

pip install transformers

pip install ntlk

pip install Sentencepiece

Structure

models.py: contains the model definition code of BiLSTM, TextCNN and Transformer class

utils.ipynb: contains the helper functions for pre-processing

Pre-processing.ipynb: contains the code to preprocess raw text

train_models.ipynb: contains the code to:

  • Fine-tune BERT on standard dataset
  • Fine-tune GPT2 on standard dataset
  • Fine-tune BIGBIRD on standard dataset
  • BERT word embedding + BiLSTM
  • BERT word embedding + TextCNN
  • Fine-tune BERT on large data set
  • Fine-tune BERT with multi-label data

random_forest.ipynb: contains the code to aggregate five trained models with random forest.

Instuctions

The preprocessing of raw text can be reproduced in:

Pre-processing.ipynb

However, as the raw data is too large. We didn't put them on Github. You can directly use train and test the model using processed datasets (without running Pre-processing).

You can reproduce the test performances of different models in:

reproduce.ipynb

You can train the models in:

train_models.ipynb

All the necessary datasets and models for training and reproducing the results using reproduce.ipynb and train_models.ipynb can be downloaded at: https://drive.google.com/drive/folders/1sRMolxsVHLiLphfnS4NpDM0766XFOkZU?usp=sharing

References

[1] A. Thompson, (2017) “All the news: 143,000 articles from 15 americanpublications,” https://www.kaggle.com/snapcrack/all-the-new

[2] N. Reimers and I. Gurevych, (2020) “Making monolingual sentence em-beddings multilingual using knowledge distillation,” arXiv preprintarXiv:2004.09813

[3] W. Foundation. Wikimedia downloads. [Online]. Available: https://dumps.wikimedia.org

[4] K. Gopalakrishnan, B. Hedayatnia, Q. Chen, A. Gottardi, S. Kwatra,A. Venkatesh, R. Gabriel, D. Hakkani-T ̈ur, and A. A. AI, (2019) “Topical-chat:Towards knowledge-grounded open-domain conversations.” inINTER-SPEECH, pp. 1891–1895

[5] Hugging face is an AI community to provide open source NLP softwares, https://huggingface.co/

[6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al.,“Language models are unsupervised multitask learners,”OpenAI blog,vol. 1, no. 8, p. 9, 2019.

[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,”arXivpreprint arXiv:1810.04805, 2018.

[8] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti,S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yanget al., “Big bird:Transformers for longer sequences.” inNeurIPS, 2020.

old-cs-433-project-2-2020-roll_king's People

Contributors

bohan7 avatar wang-kee avatar siran-li avatar

Watchers

Matteo Pagliardini avatar Roberto Castello avatar Maksym Andriushchenko avatar ztzthu avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.