coosto / dutch-word-embeddings Goto Github PK
View Code? Open in Web Editor NEWDutch word embeddings, trained on a large collection of Dutch social media messages and news/blog/forum posts.
License: Other
Dutch word embeddings, trained on a large collection of Dutch social media messages and news/blog/forum posts.
License: Other
In the demo, it says "distance" to determine the "distance" between a given word and the set of words. It looks however that the larger the "distance", the more similar the words are, so "distance" looks like a misnomen.
Therefore it might be better to rename the column to "similarity" I think?
Another Semantle list. This time, the problem is that words that I expect were in the original text as alternatives ("hij/zij" etc.) were fused into single words:
I verified using the demo from the README that these fused words indeed occur in the model; it's not an artifact of Semantle's code.
I was playing the Dutch version of Semantle today, a word guessing game which uses this model to compare words. There was a noticeable difference compared to the English version, which has a model based on newspaper articles. See the top 1000 similar words for today as an example.
One problem is spelling mistakes: as far as I'm aware (and I'm a native speaker), word 999, 998 and 997 don't actually exist in Dutch. I guess that they're common typos that are used in similar contexts as the real words.
Another problem, at least in my opinion, is in the associations themselves: a lot of them seem to come from a xenophobic background. I guess this may accurately reflect part of the social media sphere, but it does make use of this model for unsupervised language processing risky, as it may make associations that could reflect poorly on the person or organization running the software.
For solving the typos, maybe words that both score high in similarity and are very close in letters as well could be checked against a dictionary.
For the associations themselves, the problem is not so much in the training of the model as it is with the nature of the data it is trained on. Of course it is impossible to be 100% neutral politically, but I think people using a language model would expect something closer to neutral, or at least less controversial. If it cannot be fixed, maybe add a word of warning in the README.
I have a newbie problem here, I tried to load the model using
gensim.models.Word2Vec.load("dutch-word-embeddings/model.bin")
and I got the following error:
UnpicklingError Traceback (most recent call last)
/var/folders/nt/9bk8jw9n28b1p_jlynbvy4lc0000gp/T/ipykernel_46367/2953225409.py in
----> 1 model = gensim.models.Word2Vec.load("dutch-word-embeddings/model.bin")
~/Documents/dutch_venv/lib/python3.9/site-packages/gensim/models/word2vec.py in load(cls, rethrow, *args, **kwargs)
1928 """
1929 try:
-> 1930 model = super(Word2Vec, cls).load(*args, **kwargs)
1931 if not isinstance(model, Word2Vec):
1932 rethrow = True
~/Documents/dutch_venv/lib/python3.9/site-packages/gensim/utils.py in load(cls, fname, mmap)
483 compress, subname = SaveLoad._adapt_by_suffix(fname)
484
--> 485 obj = unpickle(fname)
486 obj._load_specials(fname, mmap, compress, subname)
487 obj.add_lifecycle_event("loaded", fname=fname)
~/Documents/dutch_venv/lib/python3.9/site-packages/gensim/utils.py in unpickle(fname)
1458 """
1459 with open(fname, 'rb') as f:
-> 1460 return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
1461
1462
UnpicklingError: unpickling stack underflow
I'm not sure if the problem comes from the model itself or from the function I'm using to load it (it's my first time using it).
Thanks in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.