Code Monkey home page Code Monkey logo

text-mining's Introduction

text-mining

Unstructured Data Analysis (Graduate) @Korea University

Notice

  • Syllabus (download)
  • Term project groups
    • 1조: 박성훈, 이수빈(2018021120), 이준걸, 박혜준
    • 2조: 이정호, 천우진, 유초롱, 조규원
    • 3조: 백승호, 목충협, 변준형, 이영재
    • 4조: 박건빈, 이수빈(2018020530), 변윤선, 권순찬
    • 5조: 최종현, 이정훈, 박중민, 노영빈
    • 6조: 백인성, 김은비, 신욱수, 강현규
    • 7조: 전성찬, 박현지, 문관영
    • 8조: 조용원, 정승섭, 민다빈, 최민서
    • 9조: 박명현, 장은아, 유건령
  • Term project proposal
  • Term project inteim presentation
  • Term project final presentation
  • Final Exam
    • Date and Place: 2019-06-20 15:30~17:30, New Engineering Hall 218/224 (download)
    • A non-programmable calculator is allowed, smart phones must be turned off
    • A hand-written cheating paper (A4 size, 3 pages, back and forth) is allowed

Recommended courses

Schedule

Topic 1: Introduction to Text Analytics

  • The usefullness of large amount of text data and the challenges
  • Overview of text analytics methods

Topic 2: From Texts to Data

  • Text data collection: Web scraping

Topic 3: Text Preprocessing

  • Introduction to Natural Language Processing (NLP)
  • Lexical analysis
  • Syntax analysis
  • Other topics in NLP
  • Reading materials
    • Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational intelligence magazine, 9(2), 48-57. (PDF)
    • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537. (PDF)
    • Young, T., Hazarika, D., Poria, S., & Cambria, E. (2017). Recent trends in deep learning based natural language processing. arXiv preprint arXiv:1708.02709. (PDF)

Topic 4: Neural Networks Basics

  • Perception, Multi-layered Perceptron
  • Convolutional Neural Networks (CNN)
  • Recurrent Neural Networks (RNN)
  • Practical Techniques

Topic 5-1: Document Representation I: Classic Methods

  • Bag of words
  • Word weighting
  • N-grams

Topic 5-2: Document Representation II: Distributed Representation

  • Word2Vec
  • GloVe
  • FastText
  • Doc2Vec
  • Reading materials
    • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155. (PDF)
    • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. (PDF)
    • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). (PDF)
    • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). (PDF)
    • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. (PDF)

Topic 6: Dimensionality Reduction

  • Dimensionality Reduction
  • Supervised Feature Selection
  • Unsupervised Feature Extraction: Latent Semantic Analysis (LSA) and t-SNE
  • R Example
  • Reading materials
    • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391. (PDF)
    • Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188-230.
    • Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. (PDF) (Homepage)

Topic 7: Document Similarity & Clustering

  • Document similarity metrics
  • Clustering overview
  • K-Means clustering
  • Hierarchical clustering
  • Density-based clustering
  • Reading materials
    • Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323. (PDF)

Topic 8-1: Topic Modeling I

  • Topic modeling overview
  • Probabilistic Latent Semantic Analysis: pLSA
  • LDA: Document Generation Process
  • Reading materials
    • Hofmann, T. (1999, July). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (pp. 289-296). Morgan Kaufmann Publishers Inc. (PDF)
    • Hofmann, T. (2017, August). Probabilistic latent semantic indexing. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM.
    • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. (PDF)
    • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. (PDF)

Topic 8-2: Topic Modeling II

  • LDA Inference: Gibbs Sampling
  • LDA Evaluation
  • Recommended video lectures

Topic 9: Document Classification

  • Document classification overview
  • Naive Bayesian classifier
  • RNN-based document classification
  • CNN-based document classification
  • Reading materials
    • Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. (PDF)
    • Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657) (PDF)
    • Lee, G., Jeong, J., Seo, S., Kim, C, & Kang, P. (2018). Sentiment classification with word localization based on weakly supervised learning with a convolutional neural network. Knowledge-Based Systems, 152, 70-82. (PDF)
    • Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1480-1489). (PDF)
    • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. (PDF)
    • Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. (PDF)

Topic 10: Sentiment Analysis

  • Architecture of sentiment analysis
  • Lexicon-based approach
  • Machine learning-based approach
  • Reading materials
    • Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016, November). Inducing domain-specific sentiment lexicons from unlabeled corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing (Vol. 2016, p. 595). NIH Public Access. (PDF)
    • Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1253. (PDF)

text-mining's People

Contributors

pilsung-kang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.