Code Monkey home page Code Monkey logo

word2vecvn's Introduction

I. word2vecVN

Word2Vec models for Vietnamese

Download models:

  • Model trained on Vietnamese Wiki: click here.
  • Model trained on Le et al.'s data (window-size 5, 400 dims): click here.
  • Model trained on Le et al.'s data (window-size 2, 300 dims): click here.

Visualization:

  • word2vec-visualization (using TensorBoard):
    • Download tf_files: TBA
    • Run $ tensorboard --log_dir=./ --port=10001
  • word2vec-simple-visualization: It is working well. Please read the readme file inside that folder to know how to test the model.

Note:

  • This model is trained using data of Le et al. http://mim.hus.vnu.edu.vn/phuonglh/node/72
    • Data information: 7.1G text with 1,675,819 unique words from a corpus of 974,393,244 raw words and 97,440 documents. Note that all words are tokenized words.

II. How do I cite?

Please CITE paper the Arxiv paper whenever ETNLP (or the pre-trained embeddings) is used to produce published results or incorporated into other software:

@article{vu:2019n,
  title={ETNLP: A Visual-Aided Systematic Approach to Select Pre-Trained Embeddings for a Downstream Task},
  author={Xuan-Son Vu, Thanh Vu, Son N. Tran, Lili Jiang},
  journal={In: Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP)},
  year={2019}
  }
  
 @misc{word2vecvn_2016,
    author = {Xuan-Son Vu},
    title = {Pre-trained Word2Vec models for Vietnamese},
    year = {2016},
    howpublished = {\url{https://github.com/sonvx/word2vecVN}},
    note = {commit xxxxxxx}
  }

Cited in papers:

  1. Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2018. VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL 2018, pages 56-60. bibtext, github

III. Some screenshots:

Screenshot of word2vec

Alt text

Screenshot of spacy-n-fastext

Alt text

Attributions/Thanks

word2vecvn's People

Contributors

sonvx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word2vecvn's Issues

Giúp em với ạ

Anh cho em hỏi, theo em được biết Word2Vec có 2 kiến trúc Cbow và Skipgram.
Em không biết các bộ:

  1. Model trained on Vietnamese Wiki
  2. Model trained on Le et al.'s data (window-size 5, 400 dims)
  3. Model trained on Le et al.'s data (window-size 2, 300 dims)
    Anh train theo kiến trúc nào ạ?
    Em xin cảm ơn!

deal with unknow word

In my work,when I tokenize words from news, there are many other words that is'nt appear in your vocab, so what should do with that words? I found word 'unknow' in your vocab but I'm not sure that is what I needed.
Hope your response soon. Many thanks!

Word tokenizer

Can you please tell me which tokenizer have you used? Thank you very much!

Em chào anh!!! Anh có thể hướng dẫn em các bước chạy mã này khi tải xuống (word2vecVN)

Anh có thể hướng dẫn em các bước chạy mã này khi tải xuống (word2vecVN) - em thường chạy mã python trên win, ubuntu. các phiên bản python, gensim tương ứng của mã (vì em sợ nếu chạy các bản python mới hơn, gensim mới hơn) mã sau khi tải xuống sẽ không hoạt động. Ngoài ra một số thư viện khác nếu đi kèm như (tensorflow, vvv..) anh cũng cấp cho em phiên bản của nó luôn nhé. Em cũng muốn hỏi là "word2vec" có được xem như một mạng nơ ron không anh?
em cảm ơn anh. Chúc anh sức khoẻ, hạnh phúc, thành công trong công việc, trong cuộc sống (Mong nhận được hồi âm của anh sớm nhất). Em chào anh!!!

Tạo dữ liệu

Làm thế nào để tôi tự tạo bộ từ điển của mình?

DeprecationWarning

Em làm như hướng dẫn và bị lỗi này. Mọi người ai biết giúp em với. Tks

Traceback (most recent call last):
File "Main.py", line 52, in
word2vec_model = Word2Vec.load_word2vec_format(model, binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1420, in load_word2vec_format
raise DeprecationWarning("Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.")
DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.