I.e. say of have a list of names "John Doe", "Jane Doe". At the moment, I join them in

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

feature request: run on list of strings? about word_cloud HOT 13 CLOSED

amueller commented on August 22, 2024

feature request: run on list of strings?

from word_cloud.

Comments (13)

amueller commented on August 22, 2024

So you want to give the list of names explicitly? That shouldn't be so hard but it also seems a bit too manual to me. You could try a 2-gram tokenization and see if that works out. The scikit-learn one that I used earlier could do that, the current one can not. But you could try to either make the tokenizer an option or add new functionality to the current one.

from word_cloud.

eyanq commented on August 22, 2024

I agree with @ianhoolihan 's point. 2-gram seems not able to handle Chinese tokenization. Like "我购买了道具和服装", it means "I bought props and clothes" in English. Using 2-gram tokenization, it may means "I bought kimonos", which is quite different from its original meaning. An alternative way of inputing contents such as [(word1, frequency1), [word2, frequency 2], ...] can give me flexibility of using other tokenization methods.

from word_cloud.

amueller commented on August 22, 2024

I don't really understand your example for lack of knowledge of Chinese. Why would 2-gram tokenization change the meaning?

You can manually set the words_ attribute of the wordcloud object and then call fit_words_. Does that work for you?

from word_cloud.

eyanq commented on August 22, 2024

Hi @amueller, the method you gave works fine for me. :)

Let me try to explain the Chinese word segmentation. Single English sentences can be segmented using whitespace and ngram, while in Chinese using ngram methods may not able to do that.

Like the example I mentioned before

我购买了道具和服装

The right segmentation should be something like

我 / 购买 / 了 / 道具 / 和 / 服装

Which means

I bought / props / and / clothes /

in English.

If you use 2-gram segmentation only, you will get

我购(0) / 购买(1) / 买了(2) / 了道(3) / 道具(4) / 具和(5) / 和服(6) / 服装(7) /

Which means

I bou(0) / bought(1) / ought(2) / nonsense(3) / props(4) / nonsense(5) / kimonos(6) / clothes(7)

in English.

As you can see above, the sentence may be segmented like

I / bought / kimonos /

You can use other more complex methods like Hidden Markov Chain or Bidirectional maximum matching algorithm instead of using ngram.

from word_cloud.

amueller commented on August 22, 2024

Ah, ok, so it is a problem of word segmentation. Well yeah we are not gonna solve that here. Glad you found a way to make it work for you.

from word_cloud.

dblclik commented on August 22, 2024

Hey Andreas--we implemented a way to accept either a custom tokenized list (for ngrams) or use the default long string and it seems to be working well. Currently it uses the WordNetLemmatizer from the NLTK package, but if you want to see the code let me know and I'll share it (if not, no worries, just wanted to offer). Thanks again for the WC package.

from word_cloud.

amueller commented on August 22, 2024

I probably don't want to depend on NLTK by default but having it as an option might be nice. If you like you can create a PR and I'll have a look.

from word_cloud.

az0 commented on August 22, 2024

I had a related issue where I wanted to plot the topic from an LDA model from gensim, and it would be handy if wordcloud had a simple function that accepts a list of (word, weight) tuples and then calls fit_words.

from word_cloud.

amueller commented on August 22, 2024

@az0 Cool I've mean meaning to do that for ever. You can use the topics to color the words btw ;)
So didn't the thing I mentioned above work for you:

You can manually set the words_ attribute of the wordcloud object and then call fit_words_. Does that work for you?

Or you you mean you'd like to have a single method to do that, so that it is more easily visible?

from word_cloud.

az0 commented on August 22, 2024

@amueller : Yes, it did work for me, but it would be easier to find as its own function.

(BTW, I made a separate word cloud for each topic, and some words overlap between topics.)

from word_cloud.

amueller commented on August 22, 2024

I agree. I'll open it as it's own issue.
You could use hue to indicate topics... but I guess with many topics that might not be super informative. Do you have links to your plots?

from word_cloud.

amueller commented on August 22, 2024

fyi @az0 PR welcome (with test + doc ;)

from word_cloud.

amueller commented on August 22, 2024

there is now generate_from_frequencies

from word_cloud.

feature request: run on list of strings? about word_cloud HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent