Code Monkey home page Code Monkey logo

dutch-word-embeddings's People

Contributors

severun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dutch-word-embeddings's Issues

Rename "distance" to "similarity"

In the demo, it says "distance" to determine the "distance" between a given word and the set of words. It looks however that the larger the "distance", the more similar the words are, so "distance" looks like a misnomen.

Therefore it might be better to rename the column to "similarity" I think?

Alternatives fused into a single word

Another Semantle list. This time, the problem is that words that I expect were in the original text as alternatives ("hij/zij" etc.) were fused into single words:

  • hijzij
  • hemhaar
  • welniet
  • zijnhaar
  • zijhij

I verified using the demo from the README that these fused words indeed occur in the model; it's not an artifact of Semantle's code.

Use of social media as source material

I was playing the Dutch version of Semantle today, a word guessing game which uses this model to compare words. There was a noticeable difference compared to the English version, which has a model based on newspaper articles. See the top 1000 similar words for today as an example.

One problem is spelling mistakes: as far as I'm aware (and I'm a native speaker), word 999, 998 and 997 don't actually exist in Dutch. I guess that they're common typos that are used in similar contexts as the real words.

Another problem, at least in my opinion, is in the associations themselves: a lot of them seem to come from a xenophobic background. I guess this may accurately reflect part of the social media sphere, but it does make use of this model for unsupervised language processing risky, as it may make associations that could reflect poorly on the person or organization running the software.

For solving the typos, maybe words that both score high in similarity and are very close in letters as well could be checked against a dictionary.

For the associations themselves, the problem is not so much in the training of the model as it is with the nature of the data it is trained on. Of course it is impossible to be 100% neutral politically, but I think people using a language model would expect something closer to neutral, or at least less controversial. If it cannot be fixed, maybe add a word of warning in the README.

Unpickling error when loading model

I have a newbie problem here, I tried to load the model using
gensim.models.Word2Vec.load("dutch-word-embeddings/model.bin") and I got the following error:


UnpicklingError Traceback (most recent call last)
/var/folders/nt/9bk8jw9n28b1p_jlynbvy4lc0000gp/T/ipykernel_46367/2953225409.py in
----> 1 model = gensim.models.Word2Vec.load("dutch-word-embeddings/model.bin")

~/Documents/dutch_venv/lib/python3.9/site-packages/gensim/models/word2vec.py in load(cls, rethrow, *args, **kwargs)
1928 """
1929 try:
-> 1930 model = super(Word2Vec, cls).load(*args, **kwargs)
1931 if not isinstance(model, Word2Vec):
1932 rethrow = True

~/Documents/dutch_venv/lib/python3.9/site-packages/gensim/utils.py in load(cls, fname, mmap)
483 compress, subname = SaveLoad._adapt_by_suffix(fname)
484
--> 485 obj = unpickle(fname)
486 obj._load_specials(fname, mmap, compress, subname)
487 obj.add_lifecycle_event("loaded", fname=fname)

~/Documents/dutch_venv/lib/python3.9/site-packages/gensim/utils.py in unpickle(fname)
1458 """
1459 with open(fname, 'rb') as f:
-> 1460 return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
1461
1462

UnpicklingError: unpickling stack underflow


I'm not sure if the problem comes from the model itself or from the function I'm using to load it (it's my first time using it).

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.