word2vec-api

Reworking of https://github.com/3Top/word2vec-api to add:

deployment via ssl on nginx / uwsgi
TODO: meaningful errors when a term is not available
TODO: Cleaned Output

Simple web service providing a word embedding API.

The methods are based on Gensim Word2Vec implementation.

Parameters are set in conf.py for ease of use with uwsgi

Environment Setup and usage notes

This variant of word2vec-api was developed on an AWS Centos x64 6.5 instance using Pycharm 4.5.3
The main requirement was enough ram to load the Google News model, so an m3.xlarge (16Gb ram) instance was used.
A virtualenv with Python 2.7.9 was created, and populated with the provided requirements.txt.
Note that to get Scipy and Numpy working properly for gensim you may have to install yum packages as root, see the specific documentation for installing those packages for your OS.
To run this as non-root user, you may have to create and permission /var/log/uwsgi.log appropriately.
Each time you launch the script it will import the model specified in the conf file - it is significantly faster if you unzip it first.

You can download the Google News Vectors as a test model using the following linux command:

wget https://www.googledrive.com/host/0B7XkCwpI5KDYNlNUTTlSS21pQmM -O GoogleNews-vectors-negative300.bin.gz

Example calls

curl http://127.0.0.1:3031/n_similarity?ws1=Sushi&ws1=Shop&ws2=Japanese&ws2=Restaurant
curl http://127.0.0.1:3031/similarity?w1=Sushi&w2=Japanese
curl http://127.0.0.1:3031/most_similar?positive=indian&positive=food[&negative=][&topn=]
curl http://127.0.0.1:3031/model?word=restaurant
curl http://127.0.0.1:3031/search?q=shop

Note: The "model" method returns a base64 encoding of the Word2Vec vector.

Where to get a pretrained model

In case you do not have domain specific data to train, it can be convenient to use a pretrained model. Please feel free to submit additions to this list through a pull request.

Model file	Number of dimensions	Corpus (size)	Vocabulary size	Author	Architecture	Training Algorithm	Context window - size	Web page
Google News	300	Google News (100B)	3M	Google	word2vec	negative sampling	BoW - ~5	link
Freebase IDs	1000	Gooogle News (100B)	1.4M	Google	word2vec, skip-gram	?	BoW - ~10	link
Freebase names	1000	Gooogle News (100B)	1.4M	Google	word2vec, skip-gram	?	BoW - ~10	link
Wikipedia+Gigaword 5	50	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe	AdaGrad	10+10	link
Wikipedia+Gigaword 5	100	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe	AdaGrad	10+10	link
Wikipedia+Gigaword 5	200	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe	AdaGrad	10+10	link
Wikipedia+Gigaword 5	300	Wikipedia+Gigaword 5 (6B)	400,000	GloVe	GloVe	AdaGrad	10+10	link
Common Crawl 42B	300	Common Crawl (42B)	~2M	GloVe	GloVe	GloVe	AdaGrad	link
Twitter (2B Tweets)	25	Twitter (27B)	?	GloVe	GloVe	GloVe	AdaGrad	link
Twitter (2B Tweets)	50	Twitter (27B)	?	GloVe	GloVe	GloVe	AdaGrad	link
Twitter (2B Tweets)	100	Twitter (27B)	?	GloVe	GloVe	GloVe	AdaGrad	link
Twitter (2B Tweets)	200	Twitter (27B)	?	GloVe	GloVe	GloVe	AdaGrad	link
Wikipedia dependency	300	Wikipedia (?)	174,015	Levy & Goldberg	word2vec modified	word2vec	syntactic dependencies	link
DBPedia vectors	1000	Wikipedia (?)	?	wiki2vec	word2vec	word2vec, skip-gram	BoW, 10	link

chaffelson / word2vec-api Goto Github PK

word2vec-api's Introduction

word2vec-api

Environment Setup and usage notes

Example calls

Where to get a pretrained model

word2vec-api's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent