Code Monkey home page Code Monkey logo

Comments (13)

amueller avatar amueller commented on August 22, 2024

So you want to give the list of names explicitly? That shouldn't be so hard but it also seems a bit too manual to me. You could try a 2-gram tokenization and see if that works out. The scikit-learn one that I used earlier could do that, the current one can not. But you could try to either make the tokenizer an option or add new functionality to the current one.

from word_cloud.

eyanq avatar eyanq commented on August 22, 2024

I agree with @ianhoolihan 's point. 2-gram seems not able to handle Chinese tokenization. Like "我购买了道具和服装", it means "I bought props and clothes" in English. Using 2-gram tokenization, it may means "I bought kimonos", which is quite different from its original meaning. An alternative way of inputing contents such as [(word1, frequency1), [word2, frequency 2], ...] can give me flexibility of using other tokenization methods.

from word_cloud.

amueller avatar amueller commented on August 22, 2024

I don't really understand your example for lack of knowledge of Chinese. Why would 2-gram tokenization change the meaning?

You can manually set the words_ attribute of the wordcloud object and then call fit_words_. Does that work for you?

from word_cloud.

eyanq avatar eyanq commented on August 22, 2024

Hi @amueller, the method you gave works fine for me. :)

Let me try to explain the Chinese word segmentation. Single English sentences can be segmented using whitespace and ngram, while in Chinese using ngram methods may not able to do that.

Like the example I mentioned before

我购买了道具和服装

The right segmentation should be something like

我 / 购买 / 了 / 道具 / 和 / 服装

Which means

I bought / props / and / clothes /

in English.

If you use 2-gram segmentation only, you will get

我购(0) / 购买(1) / 买了(2) / 了道(3) / 道具(4) / 具和(5) / 和服(6) / 服装(7) /

Which means

I bou(0) / bought(1) / ought(2) / nonsense(3) / props(4) / nonsense(5) / kimonos(6) / clothes(7)

in English.

As you can see above, the sentence may be segmented like

I / bought / kimonos /

You can use other more complex methods like Hidden Markov Chain or Bidirectional maximum matching algorithm instead of using ngram.

from word_cloud.

amueller avatar amueller commented on August 22, 2024

Ah, ok, so it is a problem of word segmentation. Well yeah we are not gonna solve that here. Glad you found a way to make it work for you.

from word_cloud.

dblclik avatar dblclik commented on August 22, 2024

Hey Andreas--we implemented a way to accept either a custom tokenized list (for ngrams) or use the default long string and it seems to be working well. Currently it uses the WordNetLemmatizer from the NLTK package, but if you want to see the code let me know and I'll share it (if not, no worries, just wanted to offer). Thanks again for the WC package.

from word_cloud.

amueller avatar amueller commented on August 22, 2024

I probably don't want to depend on NLTK by default but having it as an option might be nice. If you like you can create a PR and I'll have a look.

from word_cloud.

az0 avatar az0 commented on August 22, 2024

I had a related issue where I wanted to plot the topic from an LDA model from gensim, and it would be handy if wordcloud had a simple function that accepts a list of (word, weight) tuples and then calls fit_words.

from word_cloud.

amueller avatar amueller commented on August 22, 2024

@az0 Cool I've mean meaning to do that for ever. You can use the topics to color the words btw ;)
So didn't the thing I mentioned above work for you:

You can manually set the words_ attribute of the wordcloud object and then call fit_words_. Does that work for you?

Or you you mean you'd like to have a single method to do that, so that it is more easily visible?

from word_cloud.

az0 avatar az0 commented on August 22, 2024

@amueller : Yes, it did work for me, but it would be easier to find as its own function.

(BTW, I made a separate word cloud for each topic, and some words overlap between topics.)

from word_cloud.

amueller avatar amueller commented on August 22, 2024

I agree. I'll open it as it's own issue.
You could use hue to indicate topics... but I guess with many topics that might not be super informative. Do you have links to your plots?

from word_cloud.

amueller avatar amueller commented on August 22, 2024

fyi @az0 PR welcome (with test + doc ;)

from word_cloud.

amueller avatar amueller commented on August 22, 2024

there is now generate_from_frequencies

from word_cloud.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.