explosion / sense2vec Goto Github PK

View Code? Open in Web Editor NEW

1.6K 49.0 239.0 1.17 MB

🦆 Contextually-keyed word vectors

Home Page: https://explosion.ai/blog/sense2vec-reloaded

License: MIT License

Python 99.75% Shell 0.25%

spacy nlp natural-language-processing word2vec python sense2vec gensim gensim-word2vec machine-learning

sense2vec's People

Contributors

Stargazers

Watchers

Forkers

saurabhdhupar codeaudit wavelets ml-lab adrianhust techscientist kod3r ml-ai-nlp-ir renecyranek init-random atran elyase the-kunze anukat2015 chrisemoulton vortext natsheh datareply semanticbeeng christopherahern thecolorblue pombredanne ankit-cliqz louiekang ambier imclab moultondigital veterun mustyy vyraun jordipala pdeburen meangrape mayblue9 anythingc0de chagge cequencer chulakar p2501g2 johndpope rucinfo-tiffany magnusnissel cliffnyang decebel digideskio allensmile dtsukiyama benjamesbabala bolaka waterzxj faezs seoninja13 coachwei tranlm dlmiyamoto virneo aryavohra mohan-chinnappan-n dingchaoz edhsu1984 spate141 elfscript akshayjh sumeetkr sohomghosh dhanush-ai1990 pjpan dan-zheng ruimao1988 umang-t-patel mengqhui fengyachao jumbojett kapoorabhish tanu91 bidexbido linron84 rtvt123 dspp779 maxgrossenbacher vishalbelsare farizrahman4u mjgrav2001 tnrsoft vladalexgit yocheah zldeng voxlogic kohn1001 bbnsumanth agcopenhaver caitlindong francescosaveriozuppichini lulzzz wayneouyang abbottlane-zz reloadbrain orsonadams spencerx prerit2010

sense2vec's Issues

Error while opening own trained vectors file

I was able to train data using train_word2vec.py after preprocessing the data using merge_text.py.
Below is the outcome of train_word2vec.py:

Then I input the vectors.bin to the new version 0.2.0 of sense2vec and I got an IOerror. The following is what I put to load the vectors:

from sense2vec.vectors import VectorMap
vector_map = VectorMap(128)
vector_map.load("/home/noname/Documents/data/vectors")

The error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-315510f2d9d1> in <module>()
      1 vector_map = VectorMap(128)
----> 2 vector_map.load("/home/noname/Documents/data/vectors")

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.load (sense2vec/vectors.cpp:4870)()
    100 
    101     def load(self, data_dir):
--> 102         self.data.load(path.join(data_dir, 'vectors.bin'))
    103         with open(path.join(data_dir, 'strings.json')) as file_:
    104             self.strings.load(file_)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorStore.load (sense2vec/vectors.cpp:7049)()
    200         cdef float[:] cv
    201         for i in range(nr_vector):
--> 202             cfile.read_into(&tmp[0], self.nr_dim, sizeof(tmp[0]))
    203             ptr = &tmp[0]
    204             cv = <float[:128]>ptr

/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1147)()
     25         st = fread(dest, elem_size, number, self.fp)
     26         if st != number:
---> 27             raise IOError
     28 
     29     cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:

IOError:

Also I wanted to ask that how do I get the relevant freqs.json and strings.json for the trained vectors. For the strings.json, I have the batch outputs from merge_text.py. So they need to be mapped to the relevant information in freqs.json. If there is already a function that does it and I missed calling it, please let me know.

Python version: 2.7.11
Spacy version: 0.100.5

Help loading model

I downloaded the trained model from:

https://index.spacy.io/models/reddit_vectors-1.0.1/archive.gz

How can I load this into a VectorMap or a gensim model in order to make similarity queries?

sense2vec and spacy

I get an error when I try to import sense2vec after I import spacy (v2.0.11):

import sense2vec

File "/usr/local/lib/python3.5/dist-packages/sense2vec/init.py", line 2, in
from .vectors import VectorMap
File ".env/lib/python2.7/site-packages/spacy/strings.pxd", line 18, in init sense2vec.vectors (sense2vec/vectors.cpp:26598)
ValueError: spacy.strings.StringStore has the wrong size, try recompiling

I am using python3- is that the issue?

Installation of sense2vec: 'sense2vec/vector.cpp'

Hello,

I've been trying to install sense2vec, and although I think I've made some progress, I seem to be stuck with the following error:
fatal error C1083: Cannot open source file: 'sense2vec/vectors.cpp'. It claims to file exists.

I am installing using pip, i.e. 'pip install -e git+git://github.com/spacy-io/sense2vec.git#egg=sense2vec'

Thank you in advance.

sense2vec (0.6.0) not loading after upgrading spacy to version 1.8.2

sense2vec (0.6.0) is not working with latest spacy version 1.8.2.

I’m running Python (2.7.13) Anaconda version (4.3.22) on Ubuntu 14.04.4 LTS

It was working fine with spacy version 0.101.0, but after upgrading spacy and its corresponding model I’m unable to load sense2vec. Its throwing ValueError: spacy.strings.StringStore has the wrong size

If I try to reinstall sense2vector, the spacy version getting reverted back to 0.101.0.

We need to upgrade spacy to latest version for German and Spanish language support, and also need to continue using sense2vector in one of our existing functionality.

Any idea how to resolve the current issue and have spacy (1.8.2) + textacy (0.3.4) + sense2vec (0.6.0) together on my system?

Here is spacy information installed on my system:

And here is the error I’m getting when trying to import sense2vec:

Sentence similarity and sense2vec live demos are not working

Not sure if this is the right place to report it, but I figured some of you might want to know. The attached picture is from the sentence similarity demo page, but the same goes for sense2vec.

"Can't run sense2vec: document not tagged" when using nlp.pipe()

I just installed sense2vec from pip (v1.0.0a0), and I wanted to use s2v with spacy's nlp pipeline. However, when I entered the pipe, the script fails and throws this error:

Traceback (most recent call last):
  File "text_extract.py", line 29, in <module>
    for row, doc in enumerate(nlp.pipe(texts, n_threads=8, batch_size=100)):
  File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 578, in pipe
    for doc in docs:
  File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 753, in _pipe
    doc = func(doc)
  File "/usr/local/lib/python3.5/dist-packages/sense2vec/__init__.py", line 40, in __call__
    raise ValueError("Can't run sense2vec: document not tagged.")
ValueError: Can't run sense2vec: document not tagged.

I noticed that here you commented out the two lines at lines 23-24:

    #if not doc.is_tagged:
    #    raise ValueError("Can't run sense2vec: document not tagged.")

Once I did the same in my version, I was able to successfully use the pipeline. Perhaps all this issue needs is a readme change? It looks like your current version on github fixes this problem, but the suggested pip install breaks when using nlp.pipe().

Blog post question

For the first code snippet in this link:

https://explosion.ai/blog/sense2vec-with-spacy

def transform_texts(texts):
    # Load the annotation models
    nlp = English()
    # Stream texts through the models. We accumulate a buffer and release
    # the GIL around the parser, for efficient multi-threading.
    for doc in nlp.pipe(texts, n_threads=4):
        # Iterate over base NPs, e.g. "all their good ideas"
        for np in doc.noun_chunks:
            # Only keep adjectives and nouns, e.g. "good ideas"
            while len(np) > 1 and np[0].dep_ not in ('amod', 'compound'):
                np = np[1:]
            if len(np) > 1:
                # Merge the tokens, e.g. good_ideas
                np.merge(np.root.tag_, np.text, np.root.ent_type_)
            # Iterate over named entities
            for ent in doc.ents:
                if len(ent) > 1:
                    # Merge them into single tokens
                    ent.merge(ent.root.tag_, ent.text, ent.label_)
        token_strings = []
        for token in tokens:
            text = token.text.replace(' ', '_')
            tag = token.ent_type_ or token.pos_
            token_strings.append('%s|%s' % (text, tag))
        yield ' '.join(token_strings)

where is the "tokens" variable defined (from the for loop)?

Sense2vec and Spacy: How to choose the "sense" i.e. POS or entity labels

Using Sense2vec in conjunction with Spacy, is there a way to choose the part-of speech tag/ entity label for a token when the attribute s2v_most_similar is applied?

e.g. for the token "duck", the default sense/POS is NOUN when the attribute s2v_most_similar is applied.
Using spacy with sense2vec, is there a way to get the s2v_most_similar for "duck" as in the VERB?

Thanks!

I'm not able to use the '.add_pipe' attribute

>>> nlp.add_pipe(s2v)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'English' object has no attribute 'add_pipe'

Is it called something else now?

Sense2vec Similarity Question

Why do 'flies|VERB' and 'flies|NOUN' have a similarity of 1.0?
I'm running sense2vec on Anaconda, with Python 3.6 on OS X 10.11.6

$ python --version
Python 3.6.3 :: Anaconda custom (64-bit)
$ sputnik --name sense2vec --repository-url http://index.spacy.io install reddit_vectors
Downloading...
Downloaded 560.90MB 100.00% 2.15MB/s eta 0s              
archive.gz checksum/md5 OK
INFO:sputnik.pool:install reddit_vectors-1.1.0
$ conda list spacy
# packages in environment at /Users/davidlaxer/anaconda/envs/spacy:
#
spacy                     2.0.4                    py36_0    conda-forge
spacy                     0.101.0                   <pip>
$ conda list sense2vec
# packages in environment at /Users/davidlaxer/anaconda/envs/spacy:
#
sense2vec                 0.6.0                     <pip>
$ conda list thinc
# packages in environment at /Users/davidlaxer/anaconda/envs/spacy:
#
thinc                     6.10.0                   py36_0    conda-forge
thinc                     5.0.8                     <pip>

Here's my example:

import sense2vec
model = sense2vec.load()
freq, query_vector1 = model["flies|NOUN"]
model.most_similar(query_vector1, n=5)
(['flies|NOUN', 'gnats|NOUN', 'snakes|NOUN', 'birds|NOUN',  'grasshoppers|NOUN'],
 <MemoryView of 'ndarray' at 0x1af394c540>)

freq, query_vector2 = model["flies|VERB"]
model.most_similar(query_vector2, n=5)

(['flies|VERB', 'flys|VERB', 'flying|VERB', 'jumps|VERB', 'swoops|VERB'],
 <MemoryView of 'ndarray' at 0x1af394c6e8>)
In [42]: model.data.similarity(query_vector1, query_vector1)
1.0

From a model I trained:

In [40] new_model = gensim.models.Word2Vec.load('/Users/davidlaxer/LSTM-Sentiment-Analysis/corpus_output_256.txt')
In [41] new_model.similarity('flies|NOUN', 'flies|VERB')
0.9954307438574328
In [43] new_model.wv.vocab["flies|VERB"].index
5895
In [44] new_model.wv.vocab["flies|NOUN"].index
7349
In [45] new_model.wv["flies|VERB"]  
array([ 0.15279259,  0.04471067,  0.0923325 , -0.07349139,  0.04180749,
     -0.71864516,  0.08252977, -0.02405624,  0.28384277,  0.01706951,
     -0.15931296, -0.21216595, -0.0352594 ,  0.13597694,  0.07868216,
     -0.15907238, -0.30132023,  0.01954124,  0.22636545, -0.19983807,
     -0.03842518,  0.49959993, -0.18679027, -0.16045345,  0.05813084,
      0.12905809,  0.1305625 ,  0.42689237,  0.19311258, -0.1002808 ,
      0.07427863, -0.19840011,  0.42542475, -0.32158205,  0.15129171,
     -0.32177079, -0.04034998, -0.05301504,  0.38441092, -0.31020632,
      0.42528978, -0.26249531, -0.25648555,  0.16558036,  0.28656447,
     -0.11909373,  0.09208378, -0.08886475, -0.40061441,  0.02873728,
      0.07275984, -0.05674595, -0.09471942, -0.01308586, -0.2777423 ,
     -0.05253473, -0.00179329, -0.15887854,  0.31784746, -0.00895729,
      0.50658983,  0.09232203,  0.16289137, -0.20241632, -0.01240843,
      0.20972176,  0.065593  ,  0.40676439, -0.16795945,  0.08079262,
      0.27334401,  0.16058736, -0.15362383, -0.13958427,  0.17041191,
     -0.08574789, -0.20200305,  0.16288304,  0.11220794,  0.44721738,
     -0.14058201,  0.13652138, -0.0134679 ,  0.20938247,  0.34156594,
      0.21730828, -0.19907214,  0.02451441,  0.12492239,  0.08635994,
     -0.29003018,  0.01458945,  0.02637799,  0.10671763, -0.17983682,
      0.01115436, -0.02827467,  0.13415532,  0.4656623 , -0.34222263,
      0.44238791, -0.29407004, -0.16681372,  0.04466435, -0.21825369,
     -0.09138768,  0.02407285, -0.57841706, -0.19544049, -0.07518575,
      0.36430466, -0.13164517, -0.01708322,  0.11068137,  0.2811991 ,
      0.02544841,  0.10672008,  0.06147943,  0.09167367, -0.71296901,
      0.04190712, -0.47360554, -0.01762259,  0.0359503 , -0.24351278,
     -0.01718491, -0.04033662,  0.03032484, -0.33736056, -0.13555804,
      0.02156358, -0.50073934, -0.0706998 ,  0.41698509, -0.23886077,
     -0.06120266, -0.0681426 ,  0.15182504,  0.13283113, -0.05899575,
     -0.11477304, -0.18594885, -0.17855589,  0.31381837,  0.25157636,
      0.41943148,  0.05070408, -0.03173119, -0.04240219, -0.25305411,
     -0.36856946,  0.20292452,  0.10858628,  0.17122397,  0.01447193,
     -0.47961271, -0.45739996,  0.17185016, -0.03916142, -0.04544915,
      0.34947339,  0.04178765,  0.37088165,  0.14284173,  0.03443905,
      0.30170318,  0.05259432, -0.22402297,  0.05495254, -0.46103877,
     -0.22059456, -0.27414244,  0.55484813,  0.1569699 ,  0.35761088,
      0.08712664,  0.23313828, -0.25803107, -0.03343969, -0.14713305,
     -0.0611255 ,  0.17435439, -0.01603068,  0.00526717, -0.08379596,
     -0.08644171, -0.12666632,  0.12955435,  0.48045933, -0.17596652,
     -0.29505005,  0.60152525, -0.01975689,  0.02343576,  0.17027852,
     -0.06638149, -0.10826188, -0.41277543, -0.12114278, -0.01596882,
      0.02660148,  0.22383556, -0.030263  , -0.0768819 , -0.32506746,
     -0.15082234, -0.16559191, -0.08502773, -0.01570902, -0.22921689,
      0.19637343, -0.4993245 ,  0.19670881,  0.17284806,  0.10345648,
      0.45276237, -0.12255403,  0.18032061,  0.05677452,  0.09869532,
     -0.23536956, -0.22449525,  0.51938456,  0.24111946,  0.26022053,
     -0.18190917, -0.01768251,  0.00435291,  0.05820792, -0.46525213,
      0.17490779,  0.15250422, -0.1760795 ,  0.14194083,  0.09954269,
     -0.89346975, -0.11642933,  0.0944154 ,  0.2134015 , -0.01955901,
     -0.02899018,  0.07254739, -0.03995875,  0.39499217, -0.05394226,
     -0.07821836, -0.29973337, -0.11607374, -0.01082127,  0.36769736,
      0.04288069, -0.0461933 ,  0.00675509,  0.25210902, -0.21784271,
     -0.18479778], dtype=float32)
In [46]: new_model.wv["flies|NOUN"]
array([ 0.1304135 ,  0.05724983,  0.06886293, -0.03062466,  0.01640639,
     -0.53799176,  0.10968599, -0.02839088,  0.18814373,  0.00147691,
     -0.11227507, -0.14502132, -0.03685957,  0.06422875,  0.07289967,
     -0.10437401, -0.23557086,  0.00153201,  0.17661473, -0.12828164,
     -0.02789859,  0.35942602, -0.1580196 , -0.13264264,  0.03343309,
      0.10922851,  0.1102568 ,  0.29480889,  0.14417146, -0.07892705,
      0.06608826, -0.14885685,  0.32329369, -0.23263605,  0.11967299,
     -0.23964159, -0.02619613,  0.00930338,  0.31111386, -0.22507732,
      0.32475442, -0.19287167, -0.19306417,  0.10722513,  0.2237518 ,
     -0.06828826,  0.07246322, -0.06233693, -0.31375739,  0.01069155,
      0.04457425, -0.00323939, -0.05079295, -0.02164256, -0.22060572,
     -0.03816675,  0.00503534, -0.10069088,  0.24429323,  0.02505454,
      0.38344654,  0.09145252,  0.11439045, -0.10801487, -0.01075712,
      0.16894275,  0.04799445,  0.3149668 , -0.13885498,  0.02068597,
      0.17856079,  0.11587915, -0.11973458, -0.0896498 ,  0.11993878,
     -0.06647626, -0.15219077,  0.10705566,  0.07842658,  0.31101131,
     -0.12788543,  0.09909476,  0.00878725,  0.1618593 ,  0.22566552,
      0.1297064 , -0.14370884,  0.02069237,  0.08489513,  0.0567583 ,
     -0.21860926,  0.01057386,  0.03844477,  0.06213358, -0.12877114,
      0.02327059, -0.00917741,  0.11733869,  0.35853127, -0.25572705,
      0.30879059, -0.20568153, -0.12405248,  0.03546307, -0.18377842,
     -0.06700096,  0.00626029, -0.42848313, -0.13129929, -0.04215423,
      0.26977378, -0.07725398,  0.01177794,  0.05952175,  0.21516307,
      0.01055368,  0.06727242,  0.05038245,  0.06739338, -0.53844106,
      0.02834721, -0.33890292, -0.02644366,  0.03540507, -0.16382404,
     -0.01353777, -0.02502321,  0.00226415, -0.24348356, -0.12502551,
      0.01489578, -0.37660655, -0.05798845,  0.28748602, -0.18512824,
     -0.06250153, -0.06967189,  0.14023623,  0.09628384, -0.09925015,
     -0.07317897, -0.14045765, -0.14597888,  0.24456802,  0.173549  ,
      0.3357946 ,  0.0424754 ,  0.00723427, -0.02120454, -0.14892557,
     -0.26496273,  0.14844348,  0.06555442,  0.11951103,  0.03691757,
     -0.36404395, -0.32292312,  0.09412326, -0.06377046, -0.02561374,
      0.24361259,  0.02616721,  0.29151902,  0.1178301 ,  0.03284379,
      0.20218852,  0.0337379 , -0.14703217,  0.02869225, -0.31447497,
     -0.15038867, -0.23353554,  0.41700551,  0.11959957,  0.26917797,
      0.04590914,  0.16029988, -0.18795538, -0.01343729, -0.10532234,
     -0.02617499,  0.12019841,  0.00673278, -0.0070972 , -0.03176219,
     -0.07582191, -0.07277017,  0.09928112,  0.36159652, -0.14404564,
     -0.21233276,  0.46463615,  0.01645906,  0.01815237,  0.12149289,
     -0.07040837, -0.06278557, -0.29605272, -0.07451538,  0.00487611,
      0.00313085,  0.13640559, -0.02045129, -0.05790693, -0.22582445,
     -0.10382047, -0.13318184, -0.05160375,  0.01498237, -0.15075362,
      0.14116266, -0.36445442,  0.1420894 ,  0.11182524,  0.10055254,
      0.33450282, -0.08930281,  0.15410167,  0.03961684,  0.06431124,
     -0.15608449, -0.1599745 ,  0.3780185 ,  0.18073064,  0.2190931 ,
     -0.16039631, -0.03769958, -0.00069833,  0.06914425, -0.33746576,
      0.11075038,  0.11626988, -0.12498619,  0.07928085,  0.0636186 ,
     -0.6352759 , -0.10650127,  0.03810085,  0.14585988, -0.01552053,
     -0.01488287,  0.04300846, -0.00500007,  0.26444513, -0.03629581,
     -0.04127173, -0.23304868, -0.08911316,  0.0029219 ,  0.27401808,
      0.00279731, -0.04162024,  0.00214672,  0.15316918, -0.14298579,
     -0.15343791], dtype=float32)

Error while using merge_text.py

Hi,

I am getting the following error while pre-processing data with merge_text.py:

Traceback (most recent call last):
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 143, in <module>
    main('a','b', n_workers=1)
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 139, in main
    parallelize(do_work, enumerate(jobs), n_workers, [out_dir])
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 48, in parallelize
    return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 180, in __init__
    self.results = batch()
  File "C:\Users\Adam\Anaconda2\lib\site-packages\joblib\parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 92, in parse_and_transform
    file_.write(transform_doc(nlp(strip_meta(text))))
  File "C:/Users/Adam/Google Drive/TUe/3 Semester/_thesis/Pycharm/LeadUserIdentification/sense2vec/bin/merge_text.py", line 98, in transform_doc
    for np in doc.noun_chunks:
  File "spacy/tokens/doc.pyx", line 246, in noun_chunks (spacy/tokens/doc.cpp:7745)
    for start, end, label in self.noun_chunks_iterator(self):
  File "spacy/syntax/iterators.pyx", line 11, in english_noun_chunks (spacy/syntax/iterators.cpp:1559)
    word = doc[i]
  File "spacy/tokens/doc.pyx", line 100, in spacy.tokens.doc.Doc.__getitem__ (spacy/tokens/doc.cpp:4853)
    if self._py_tokens[i] is not None:
IndexError: list index out of range

I run the merge_text.py with n_workers=1. Btw. I did a minor change of function iter_comments to work with plain text input:

def iter_comments(loc):
    with open(loc) as file_:
        for i, line in enumerate(file_):
            yield line

Do you have an idea why this happen?

Thank you,
Adam

Unable to load model

When I tried to load the model via "model = sense2vec.load()" I get the following error:

RuntimeError("Model not installed. Please run 'python -m "
RuntimeError: Model not installed. Please run 'python -m sense2vec.download' to install latest compatible model.

Then I tried to execute the command "'python -m sense2vec.download" and I got another error:

File "C:\Users\rg\Anaconda2\lib\runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "C:\Users\rg\Anaconda2\lib\runpy.py", line 72, in run_code
exec code in run_globals
File "c:\users\rg\src\sense2vec\sense2vec\download.py", line 38, in
plac.call(main)
File "C:\Users\rg\Anaconda2\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "C:\Users\rg\Anaconda2\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), *_kwargs)
File "c:\users\rg\src\sense2vec\sense2vec\download.py", line 20, in main
sputnik.package(about.title, about.version, about.default_model)
AttributeError: 'module' object has no attribute 'title'

Can you please help me?

Similarity using Sense2Vec along with Spacy

I am not able to leverage the similarity examples from the blog or issue #24 to my case. I just wanted to get the "sense" of "bear" i.e Homonyms in an example like "The bear growled. She could not bear the pain."

import spacy
nlp = spacy.load("en_core_web_md")
from sense2vec import Sense2VecComponent
text = "The bear growled. She could not bear the pain."
s2v = Sense2VecComponent('./reddit_vectors-1.1.0')
nlp.add_pipe(s2v)
doc = nlp(text)

After this i can't get any official reference for the similarity API that took two vectors. I have tried with the following but i get errors-

s2v._.similarity(doc[1]._.s2v_vec,doc[6]._.s2v_vec)
nlp._.similarity(doc[1]._.s2v_vec,doc[6]._.s2v_vec)

sense2vec drags spacy to older version?

I'm wanting to use the reddit vectors for analogies. When I install sense2vec, I run into the old issue that lexemes are unhashable. I saw that this was fixed in September of 2016, so I updated spacy, but then sense2vec complains:

spacy.strings.StringStore has the wrong size, try recompiling

Recompiling leads to unhashable lexemes again...

Am I right that the sense2vec installation uses an earlier version of spacy, or am I just clueless? Any ideas for a workaround?

How do you find to what percentage two documents are similar using sense2vec?

Consider I have two documents in English and how can I find by what percentage they are discussing the same thing in both documents using sense2vec

How to calculate the similarity between two words

Installed the sense2vec successfully and run model.most_similar(query_vector) without any problem, i.e., I can get most similar words.

However, trying to run the example of similarity score of two words, model.similarity('bacon|NOUN', 'broccoli|NOUN'), it yields the error message:
AttributeError: 'sense2vec.vectors.VectorMap' object has no attribute 'similarity'.

So, what's the right method to calculate the score of two words? Thanks!

Error when trying to open Reddit Vectors downloaded

The code right now is super-simple:

from sense2vec import Sense2VecComponent
import spacy

nlp = spacy.load('en_core_web_sm')
s2v = Sense2VecComponent('./reddit-vectors-1.1.0')
nlp.add_pipe(s2v)

Fails with "Could not open binary file" - the file is definitely there. Installed into its own venv with this sense2vec dev branch + spacy + newly downloaded en_core_web_sm.

ImportError: No module named vectors

I've installed sense2vec inside a fresh virtualenv on an Ubuntu 14.04 machine python 2.7.6.

pip install -e git+git://github.com/spacy-io/sense2vec.git#egg=sense2vec

The Cython code does not seem to be compiled.

(sense2vec) ~/sense2vec$ python -m sense2vec.download
/home/ubuntu/sense2vec/bin/python: No module named vectors

Here are the installed packages.

(sense2vec) ~/sense2vec$ pip freeze
argparse==1.2.1
cloudpickle==0.2.1
cymem==1.30
murmurhash==0.26.1
numpy==1.10.4
plac==0.9.1
preshed==0.46.2
semver==2.4.1
-e git://github.com/spacy-io/sense2vec.git@e27522f838739f033c048064dfc3077f7a4e956f#egg=sense2vec-master
six==1.10.0
spacy==0.100.6
sputnik==0.9.3
thinc==5.0.6
ujson==1.35
wsgiref==0.1.2

Same error without virtualenv.

thx

Getting 500 at demo site for a particular query

At demo site https://explosion.ai/demos/sense2vec, getting 500 for {"word":"SEP","sense":"auto"}.
I was expecting the month September at result.

Incompatible spaCy model when using sense2vec

Installing sense2vec rolls back spacy version to 0.101.0 as documented here: #25

However, none of the current english spaCy models are compatible with 0.101.0 and raises this error when trying to load:

super(Package, self).__init__(defaults=meta['package']) KeyError: 'package'

Is there a way to download old spacy models? I mainly want to use spacy to get entity tagging (NOUN, GPE etc), which can then be passed to space2vec.

Thank you!

Default pos while querying

I've seen in the demo, queries can be made without specifying POS tag and it defaults to an "auto" sense. Is there a way to replicate the same while querying with this model too? Or a way to query phrases like the "fair_game" example in the demo?

Problem installing

when I try to install via pip install -e git+git://github.com/spacy-io/sense2vec.git#egg=sense2vec
or via pip install sense2vec, I get following error message

>>> import sense2vec
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alex/anaconda2/lib/python2.7/site-packages/sense2vec/__init__.py", line 2, in <module>
    from .vectors import VectorMap
ImportError: /home/alex/anaconda2/lib/python2.7/site-packages/sense2vec/vectors.so: undefined symbol: _ZTINSt8ios_base7failureB5cxx11E

I am using ubuntu 16.04 and python 2.7 with anaconda 4.2.9. (installed requirements-all.txt)
the problem did not occur on a mac sierra 10.12.1

Train Sense2Vec on another dataset

I am new to sense2vec. Is there a way we can train another dataset using sense2vec instead of using reddit? How can we do that?

Error while using most_similar

s2v.most_similar(query_vector, 3)[0]
Traceback (most recent call last):
File "", line 1, in
File "sense2vec/vectors.pyx", line 154, in sense2vec.vectors.VectorMap.most_similar (sense2vec/vectors.cpp:4847)
File "spacy/strings.pyx", line 104, in spacy.strings.StringStore.getitem (spacy/strings.cpp:2339)
IndexError: 2065123256

Error compiling Cython file: Cython failed

pip install -r requirements.txt : All requirements are satisfied
pip install -e .

Obtaining file:///opt/sense2vec
Running setup.py (path:/opt/sense2vec/setup.py) egg_info for package from file:///opt/sense2vec


Error compiling Cython file:
    ------------------------------------------------------------
    ...
    from libcpp.vector cimport vector
    from preshed.maps cimport PreshMap
    ^
    ------------------------------------------------------------
    
    vectors.pxd:2:0: 'preshed/maps.pxd' not found
    Processing sense2vec/vectors.pyx
    Traceback (most recent call last):
      File "/opt/sense2vec/bin/cythonize.py", line 199, in <module>
        main()
      File "/opt/sense2vec/bin/cythonize.py", line 195, in main
        find_process_files(root_dir)
      File "/opt/sense2vec/bin/cythonize.py", line 187, in find_process_files
        process(cur_dir, fromfile, tofile, function, hash_db)
      File "/opt/sense2vec/bin/cythonize.py", line 161, in process
        processor_function(fromfile, tofile)
      File "/opt/sense2vec/bin/cythonize.py", line 72, in process_pyx
        raise Exception('Cython failed')
    Exception: Cython failed
    Cythonizing sources
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/opt/sense2vec/setup.py", line 165, in <module>
        setup_package()
      File "/opt/sense2vec/setup.py", line 122, in setup_package
        generate_cython(root, src_path)
      File "/opt/sense2vec/setup.py", line 63, in generate_cython
        raise RuntimeError('Running cythonize failed')
    RuntimeError: Running cythonize failed

install sense2vec in windows

following errors i am getting
Failed building wheel for thinc

Failed building wheel for cymem

fail to install

Environment:
python: 2.7.12
use anaconda to create a clean environment, following the steps:

cloning the repository
run pip install -r requirements.txt
pip install -e .

Then found the following errors:
sense2vec/vectors.cpp:8061:10: error: '__pyx_v_ptr' declared as a pointer to a reference of type 'float &'
float &*__pyx_v_ptr;
^
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

Could someone provide the solution?

Thanks,

Support for Non-English languages

Google's Syntaxnet has released pre-trained models for 40 other languages.

May I know if any of these can be used (with Spacy's Sense2Vec) to train word embeddings in languages other than English?

Thanks

Assertion error loading vector map

Model was trained with Gensim 1.0, and then converted into Sense2vec format with gensim2sense.py after adding wv for it to work with the new version of Gensim. After generating the three files freqs.json, vectors.bin and strings.json, loading the model with Vector_map has the following problem:

File "sense2vec/vectors.pyx", line 208, in sense2vec.vectors.VectorMap.load (sense2vec/vectors.cpp:6016)
File "sense2vec/vectors.pyx", line 319, in sense2vec.vectors.VectorStore.load (sense2vec/vectors.cpp:8554)
File "sense2vec/vectors.pyx", line 238, in sense2vec.vectors.VectorStore.add (sense2vec/vectors.cpp:7182)
AssertionError

It works before, except training the model with very older Gensim, which I cannot remember the version. Help is appreciated!

most_similar method

I would like to do operations of type trained_model.most_similar (positive = ['woman', 'king'], negative = ['man']) = [('queen', 0.50882536), ...], however, of sense2vec does not expect positive and negative parameters. Did someone have to do this implementation? Is this planned to be developed at sense2vec?

Segmentation fault with merge_text.py

Trying to train sense2vec on wikipedia.

after python -m spacy.en.download
when running sense2vec/bin/merge_text.py -b ~/input_dir ~/output_dir

where input_dir is a directory of cleaned wikipedia articles in plaintext files.

I get a segmentation fault in the worker threads, and 4 empty text files in output_dir

any ideas? Would it be possible to get a pre-trained, non-reddit model?

ImportError: No module named _handler

I'm facing an issue with this even after I run the handler.py successfully

Thoughts?

Error when loading reddit_vectors-1.1.0

Adding to existing Jupyter notebook that is run from a docker container, I do the following:

import sense2vec
s2v = sense2vec.load('reddit_vectors-1.1.0')

which throws the error:

/usr/local/lib/python3.5/dist-packages/sense2vec/util.py in get_package_by_name(name, via)
     18                                name or about.default_model, data_path=via)
     19     except PackageNotFoundException as e:
---> 20         raise RuntimeError("Model not installed. Please run 'python -m "
     21                            "sense2vec.download' to install latest compatible "
     22                            "model.")

Model not installed. Please run 'python -m sense2vec.download' to install latest compatible model.

I then create a local image of the container using a TensorFlow Dockerfile, and in the Dockerfile I issue RUN pip install sense2vec when building my Docker container, but if I do a RUN ipython -m sense2vec.download I get an error that print("Model already installed. Please run '%s --force to reinstall." % sys.argv[0], file=sys.stderr)
AttributeError: module 'sense2vec.about' has no attribute '__title__'.

The same happens if I SSH into the Docker container, that is, if I do an ipython -m sense2vec.download from the command line I get the error that print("Model already installed. Please run '%s --force to reinstall." % sys.argv[0], file=sys.stderr)

The iPython version is 3.5. Docker container's OS version is 64-bit: SMP Wed Mar 14 15:12:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I just noticed the requirement of CPython. Since Jupyter uses iPython, I suppose there is no way to get this to work? even if so, why the discrepancy is error from outside the container and inside it?

Please advise.

BufferError: Object is not writable.

Hello,

I'm getting this error back from Cython after just trying the example in the README. Could there be a version mismatch or something? I believe I installed according to the README as well.

Thanks!

Segmentation Fault with sense2vec loading

I successfully installed sense2vec with python2.7 on Ubuntu.
I installed python -m sense2vec.download
I successully loaded the model multiples times with the following code:

>>import sense2vec
>>model = sense2vec.load()
>>

but, now the sense2vec load creates Segmentation Fault (code dumped)

>>import sense2vec
>>model = sense2vec.load()
Segmentation Fault

I force download the model again with download.py --force, and I'm still locked with Segmentation Fault.. Any idea to slove it ?

Error using the most similar method

Following the successful installation of sense2vec, I got the model loaded as described in the response to the issue #3, but I am getting an error when I try to use the most_similar method.

Following is what I entered after loading the model:
print vector_map.most_similar("education", topn=10)

Below is the error I receive.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7f468f5b06ca> in <module>()
----> 1 print vector_map.most_similar("education", topn=10)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.most_similar (sense2vec/vectors.cpp:3363)()
     66             yield (string, freq, self.data[i])
     67 
---> 68     def most_similar(self, float[:] vector, int n):
     69         indices, scores = self.data.most_similar(vector, n)
     70         return [self.strings[idx] for idx in indices], scores

TypeError: most_similar() takes exactly 2 positional arguments (1 given)

So I understand that the most_similar method wants a float parameter followed by an int parameter. I thought the function will expect similar arguments as to gensim's word2vec implementation of most_similar method.

I request if please I could be shown how to use the most_similar method in the sense2vec implementation.

Typos in frequency.json

There are some typos like |NAWN and |NMUN instead of |NOUN in frequency.json

Installation Problem ...

That's the error I get:

C:\Python27>py -2.7 -m pip install -e git+git://github.com/spacy-io/sense2vec.gi
t#egg=sense2vec
Obtaining sense2vec from git+git://github.com/spacy-io/sense2vec.git#egg=sense2v
ec
Updating c:\python27\src\sense2vec clone
Complete output from command python setup.py egg_info:

Error compiling Cython file:
------------------------------------------------------------
...
from libcpp.vector cimport vector
from preshed.maps cimport PreshMap
^
------------------------------------------------------------

vectors.pxd:2:0: 'preshed\maps.pxd' not found
Processing sense2vec\vectors.pyx
Traceback (most recent call last):
  File "C:\Python27\src\sense2vec\bin\cythonize.py", line 199, in <module>
    main()
  File "C:\Python27\src\sense2vec\bin\cythonize.py", line 195, in main
    find_process_files(root_dir)
  File "C:\Python27\src\sense2vec\bin\cythonize.py", line 187, in find_proce

ss_files
process(cur_dir, fromfile, tofile, function, hash_db)
File "C:\Python27\src\sense2vec\bin\cythonize.py", line 161, in process
processor_function(fromfile, tofile)
File "C:\Python27\src\sense2vec\bin\cythonize.py", line 81, in process_pyx

    raise Exception('Cython failed')
Exception: Cython failed
Cythonizing sources
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python27\src\sense2vec\setup.py", line 165, in <module>
    setup_package()
  File "C:\Python27\src\sense2vec\setup.py", line 122, in setup_package
    generate_cython(root, src_path)
  File "C:\Python27\src\sense2vec\setup.py", line 63, in generate_cython
    raise RuntimeError('Running cythonize failed')
RuntimeError: Running cythonize failed

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in C:\Python27\src\s
ense2vec\

I have tried to install it with Python 3.6, but the same errors occur. I have tried many things, but nothing worked ... I'm sorry, I am not a programmer. I don't understand what the problem is.

Can you please help me?

Some errors come up when I install the sense2vec.

I install the spacy package and download the sense2vec project.When I unzip the sense2vec-master.zip and run the 'python setup.py install' to install the sense2vec,there are some errors.

`cadevil@cadevil:~/zrj/sense2vec-master$ python setup.py install
Cythonizing sources
sense2vec/vectors.pyx has not changed
running install
running bdist_egg
running egg_info
writing requirements to sense2vec.egg-info/requires.txt
writing sense2vec.egg-info/PKG-INFO
writing top-level names to sense2vec.egg-info/top_level.txt
writing dependency_links to sense2vec.egg-info/dependency_links.txt
reading manifest file 'sense2vec.egg-info/SOURCES.txt'
writing manifest file 'sense2vec.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'sense2vec.vectors' extension
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/cadevil/anaconda/include/python2.7 -I/home/cadevil/zrj/sense2vec-master/include -I/home/cadevil/anaconda/include/python2.7 -c sense2vec/vectors.cpp -o build/temp.linux-x86_64-2.7/sense2vec/vectors.o -O3 -Wno-unused-function -fopenmp -fno-stack-protector
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/cadevil/zrj/sense2vec-master/include/numpy/ndarraytypes.h:1804:0,
from /home/cadevil/zrj/sense2vec-master/include/numpy/ndarrayobject.h:17,
from /home/cadevil/zrj/sense2vec-master/include/numpy/arrayobject.h:4,
from sense2vec/vectors.cpp:267:
/home/cadevil/zrj/sense2vec-master/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
sense2vec/vectors.cpp: In function ‘PyObject* pyx_pf_9sense2vec_7vectors_11VectorStore_12load(pyx_obj_9sense2vec_7vectors_VectorStore, PyObject)’:
sense2vec/vectors.cpp:6769:11: error: cannot declare pointer to ‘float&’
float &pyx_v_ptr;
^
sense2vec/vectors.cpp: In function ‘void pyx_f_9sense2vec_7vectors_linear_similarity(int, float, float, int, const float, int, const float const, const float, int, __pyx_t_9sense2vec_7vectors_do_similarity_t)’:
sense2vec/vectors.cpp:7323:42: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
__pyx_t_4 = ((__pyx_v_queue.size() > __pyx_v_nr_out) != 0);
^
error: command 'gcc' failed with exit status 1

`
I update the cython to the newest version and the problem didn't be solved.The system is Unbuntu and the python version is 2.7

error: '__pyx_v_ptr' declared as a pointer

Running install fails on Mac OS X 10.11 pip install -e .

Installing collected packages: sense2vec
  Running setup.py develop for sense2vec
    Complete output from command /usr/local/opt/python/bin/python2.7 -c "import setuptools, tokenize;__file__='/demo/python/spacySense2vec/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
    sense2vec/vectors.pyx has not changed
    Cythonizing sources
    running develop
    running egg_info
    writing requirements to sense2vec.egg-info/requires.txt
    writing sense2vec.egg-info/PKG-INFO
    writing top-level names to sense2vec.egg-info/top_level.txt
    writing dependency_links to sense2vec.egg-info/dependency_links.txt
    warning: manifest_maker: standard file '-c' not found
    
    reading manifest file 'sense2vec.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'sense2vec.egg-info/SOURCES.txt'
    running build_ext
    building 'sense2vec.vectors' extension
    creating build
    creating build/temp.macosx-10.11-x86_64-2.7
    creating build/temp.macosx-10.11-x86_64-2.7/sense2vec
    clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/demo/python/spacySense2vec/include -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c sense2vec/vectors.cpp -o build/temp.macosx-10.11-x86_64-2.7/sense2vec/vectors.o -O3 -Wno-unused-function -fno-stack-protector
    In file included from sense2vec/vectors.cpp:325:
    In file included from /demo/python/spacySense2vec/include/numpy/arrayobject.h:15:
    In file included from /demo/python/spacySense2vec/include/numpy/ndarrayobject.h:17:
    In file included from /demo/python/spacySense2vec/include/numpy/ndarraytypes.h:1728:
    /demo/python/spacySense2vec/include/numpy/npy_deprecated_api.h:11:2: warning: "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
    #warning "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"
     ^
    sense2vec/vectors.cpp:7423:10: error: '__pyx_v_ptr' declared as a pointer to a reference of type 'float &'
      float &*__pyx_v_ptr;
             ^
    1 warning and 1 error generated.
    error: command 'clang' failed with exit status 1
    
    ----------------------------------------
Command "/usr/local/opt/python/bin/python2.7 -c "import setuptools, tokenize;__file__='/demo/python/spacySense2vec/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps" failed with error code 1 in /demo/python/spacySense2vec/

I'm trying to work with the reddit_vectors. Can't find it after install.

Hello, anybody?
I'm having trouble using the reddit vectors after downloading.
I've downloaded but I can't get the vectors to take in the example code.

>>> s2v = Sense2VecComponent('./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/vectors.bin')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/__init__.py", line 32, in __init__
self.s2v = load(vectors_path)
File "./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/__init__.py", line 10, in load
vector_map.load(vectors_path)
File "vectors.pyx", line 208, in sense2vec.vectors.VectorMap.load
File "vectors.pyx", line 306, in sense2vec.vectors.VectorStore.load
File "cfile.pyx", line 13, in sense2vec.cfile.CFile.__init__
OSError: Could not open binary file b'./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/vectors.bin/vectors.bin'

If I do a search for reddit vectors this is what I get.

$ locate reddit_vectors-1.1.0
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/__cache__/reddit_vectors-1.1.0
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/__cache__/reddit_vectors-1.1.0/archive.gz
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/__cache__/reddit_vectors-1.1.0/meta.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/freqs.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/meta.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/strings.json
./anaconda3/envs/fastai/lib/python3.6/site-packages/sense2vec/data/reddit_vectors-1.1.0/vectors.bin

On installation: UnicodeDecodeError.

I tried to install sense2vec from python 3.6.5, 64bit and encountered the following encoding error:

py -3 -m pip install sense2vec==1.0.0a0
Collecting sense2vec==1.0.0a0
Using cached https://files.pythonhosted.org/packages/28/4a/a1d9a28545adc839789c1442e7314cb0c70b8657a885f9e5b287fade7814/sense2vec-1.0.0a0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\MyUser\AppData\Local\Temp\pip-install-21y0wglc\sense2vec\setup.py", line 169, in
setup_package()
File "C:\Users\MyUser\AppData\Local\Temp\pip-install-21y0wglc\sense2vec\setup.py", line 107, in setup_package
readme = f.read()
File "C:\Program Files\Python64\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 13089: character maps to

Any quick ideas on how to fix that?

Potential Issues with the same Term being tagged with two "senses"

I was doing some casual testing of similarity and found something that seems to be a bug or at least an inconsistency that could be problematic for people.

It looks like the Entity tagging for the original training set behaves slightly differently than the tagging used for on-the-fly similarity comparisons. The issue is primarily focused around names of people. So, for at least a few different people ("Quentin Tarantino", "Dan Harmon", etc) they appear within the vectors table tagged as a PERSON. However, when parsing documents on the fly with the model and then looking for similarity, the keys aren't found. This is because they are being compared using a differently calculated "sense". They both show up within the target doc tagged as PROPN.

So, here you see that Dan_Harmon|PERSON is returned as semantically related to the phrase "writers_room" - which is perfectly legit and makes a ton of sense.

However, if I try to parse the sentence: "Dan Harmon is one of my favorite writers" or "I really like Dan Harmon as a writer", it yields a Key Error with ...

There are probably a few ways to change this around (maybe by pre-emptively changing out PROPN tokens to PERSON ? - but that feels like it will almost certainly backfire).

Let me know your thoughts.

Argument Key has the incorrect type

When working through in my python terminal when I enter
freq, query_vector = model["natural_language_processing|NOUN"]
I get this error. I am just trying to simply use the basic model.

NameError: global name 'ujson' is not defined

in https://github.com/explosion/sense2vec/blob/master/bin/merge_text.py you import ujson as json:

try:
    import ujson as json
except ImportError:
    import json

but then call ujson as ujson, not json:

def iter_comments(loc):
    with bz2.BZ2File(loc) as file_:
        for i, line in enumerate(file_):
            yield ujson.loads(line)['body']

pip install sense2vec==1.0.0a0 fails with "Failed to build Wheel" / wrong version of Spacy

I went about trying to install sense2vec through pip with pip install sense2vec==1.0.0a0 but ended up with a lot of output to stdout. In it, I see four errors:

PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
Failed building wheel for spacy
error: command 'gcc' failed with exit status 1
Failed building wheel for thinc

I can't figure out exactly what is causing the problem. In the output there's a line that states that my version of SpaCy is out of date, yet pip shows I'm using Spacy 2.0. Package/OS specs are as follow, and attached is a text of the output
sense2vec.txt

$ pip show spacy
Name: spacy
Version: 2.0.11
Summary: Industrial-strength Natural Language Processing (NLP) with Python and Cython
Home-page: https://spacy.io
Author: Explosion AI
Author-email: [email protected]
License: MIT
Location: /home/***/anaconda2/envs/ask/lib/python3.7/site-packages
Requires: numpy, murmurhash, cymem, preshed, thinc, plac, pathlib, ujson, dill, regex
Required-by: en-core-web-sm

$ python --version
Python 3.7.0

$ uname -r
4.10.0-42-generic

$ gcc --version
gcc (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Integrate Eigen Library to Remove BLAS Dependency

The first step will be to integrate the Eigen library into the codebase and have both BLAS and Eigen paths. When formalized we can then remove the BLAS dependency.

May I use Spacy Glove model instead of Reddit?

Spacy's Glove model seems more inclusive in terms of generic words like some adjectives and adverbs. Is it possible to replace Reddit model with it?

Unable to download reddit_vectors model

Hi @honnibal ,

I am getting the following error when I execute:
$ python -m sense2vec.download

File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/Users/boscoraju/src/sense2vec/sense2vec/download.py", line 38, in
plac.call(main)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), *_kwargs)
File "/Users/boscoraju/src/sense2vec/sense2vec/download.py", line 26, in main
package = sputnik.install(about.title, about.version, about.default_model)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/init.py", line 37, in install
index.update()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/index.py", line 84, in update
index = json.load(session.open(request, 'utf8'))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/session.py", line 43, in open
r = self.opener.open(request)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 466, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 484, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1297, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1256, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:645)>

Looks like it is unable to make a connection. Could you point me to right direction?

Thank you.