Hi, I am using your wordcloud for one of my projects and I am working on se

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Word cloud unicode (multilanguage) problem,about amueller/word_cloud

Comments (36)

bakrianoo commented on July 21, 2024 3

I just solved this problem by following these steps:

1 - in the library site_packages folder. I replaced the "DroidSansMono.ttf" font file with a new one with keeping the same file name.
The problem that this font file does not support Arabic . for example, I used this font:
https://brushez.com/sc-ameen-2.html

2- I used each of these libraries to solve Arabic characters direction issue.

python-bidi
arabic_reshaper

then applied this code:

from bidi.algorithm import get_display
import matplotlib.pyplot as plt
import arabic_reshaper
from wordcloud import WordCloud

text = u"انا احب اللغة العربية و حروفها I love English words"
reshaped_text = arabic_reshaper.reshape(text)
artext = get_display(reshaped_text)

wordcloud = WordCloud().generate(artext)
wordcloud.to_image()

from word_cloud.

iturki commented on July 21, 2024 2

This is two issues wrapped up in one: 1) characters are isolated and not connected. 2) text is printed LTR. This issue is well explained and solved in this blog post

from word_cloud.

amueller commented on July 21, 2024

Can you provide a minimal example of what is not working for you?

from word_cloud.

samemon commented on July 21, 2024

So I have a text (unicode encoded) file which contains :
اخوان قطر هيموتوا علشان يوجدوا اي فتنه لمصر مع اي دولة وخصوصا الخليج بس نسيو ان مصر والخليج كبار لا ينظرون لتفاهات الصغار
امثال اخوان قطر

Now when I tried using masked.py to get a tagCloud for arabic words, it gives some random letters. It produced the following image.

from word_cloud.

samemon commented on July 21, 2024

Ok So the problem was that my text file was in unicode encoding and not utf-8.
So I made it utf-8 and in the masked.py file did text = text.decode("utf-8"). And I provided the wordcloud function the font "courier"

And it gave me the following output :

Do you see the problem ?

from word_cloud.

amueller commented on July 21, 2024

no, sorry.

from word_cloud.

samemon commented on July 21, 2024

In the previous example (Arabic text), your program breaks Arabic into letters. That's what the problem is. It does not parse it as words.

Also when I tried the program on Hindi : मुक्त ज्ञानकोश विकिपीडिया से or chinese. It gives me the following picture :

from word_cloud.

amueller commented on July 21, 2024

For the Arabic, it is likely an issue with the regexp. Try another regexp. I'm not sure how word boundaries are usually encoded. It seems the whitespace doesn't work.

One issue with Chinese or Hindi might be that the font that is used by default doesn't have the right symbols. So try and use a different font.
For Chines, I am pretty sure that finding word boundaries using the regexp will not work. Doing word segmentation in Chinese is out of the scope of this project. You need to do something else to extract word frequencies and pass it to the generate_from_frequencies function.

from word_cloud.

commented on July 21, 2024

@samemon Can you explain more that how you dealt with the utf-8 problem as I got the error:
unknown locale: UTF-8. Thanks a lot!

from word_cloud.

amueller commented on July 21, 2024

@rylanchiu how did you read your file? And can you give the traceback of the error?
Your issues is somewhat unrelated to this one.

from word_cloud.

essandess commented on July 21, 2024

I also have this problem with all right-to-left languages. There are two issues:

The wordcloud words are written backwards left-to-right rather than right-to-left. For example, the word "دولة" appears in orange at the middle-right backwards as "ة ل و د". The letters must be reversed for right-to-left languages—this should be a really easy fix.
Scripted languages like Arabic have word letters broken apart with the initial letter forms appearing, rather than the scripted version. E.g. "دولة", not "د و ل ة". I'm not sure how easy a fix this would be. I'd suggest positioning the word all-at-once, rather than position individual letters that should be tied together.

from word_cloud.

amueller commented on July 21, 2024

I have not experience with this, but if you have a fix, please sent a PR.

from word_cloud.

amueller commented on July 21, 2024

@iturki do you want to add that to the documentation? Maybe give an example?

from word_cloud.

amueller commented on July 21, 2024

@bakrianoo thanks. maybe talk to @caleighm in #315. This font would also work, right? https://en.wikipedia.org/wiki/GNU_FreeFont

from word_cloud.

amueller commented on July 21, 2024

the text you gave looks a bit short for generating the image you showed. Is that really all?

from word_cloud.

bakrianoo commented on July 21, 2024

@amueller
I just updated the code above, so the generated image is a true result for the text; in addition combining Arabic and English characters.

For sorry, the "GNU" free font not worked for Arabic. but the popular "arial" font worked well as I mentioned already in the example above.
http://www5.miele.nl/apps/vg/nl/miele/mielea02.nsf/0e87ea0c369c2704c12568ac005c1831/07583f73269e053ac1257274003344e0?OpenDocument

I think it will be a good idea if you allow to pass a custom font file as a parameter.

Thank

from word_cloud.

amueller commented on July 21, 2024

What's the issue with the gnu free font? The page says it supports arabic. The arial font is not free, and there is already a custom font parameter, right (font_path)? I'm not sure why you didn't use it.

from word_cloud.

bakrianoo commented on July 21, 2024

@amueller
The problem as I mentioned in another ticket, that they missed a lot of Arabic characters. You could notice these squares [missed chars] in this image here :
https://image.ibb.co/m6OZdG/GNU.png

I think that maybe this problem occurs with other languages too.

from word_cloud.

amueller commented on July 21, 2024

@bakrianoo see my reply #315 (comment) for a fix

from word_cloud.

Keramatfar commented on July 21, 2024

@amueller that is good. But another problem exist:

You can see problem in showing words.

from word_cloud.

amueller commented on July 21, 2024

No, I can not see the problem. Please keep in mind that I can not read Arabic.

from word_cloud.

Keramatfar commented on July 21, 2024

Ok, That is Persian. I solved the problem by changing font.

from word_cloud.

amueller commented on July 21, 2024

glad you solved it.

from word_cloud.

commented on July 21, 2024

I am having issue with Nepali Language too.

https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages

from word_cloud.

amueller commented on July 21, 2024

@dataneupane this issue cand the others I linked to you actually contain a solution...

from word_cloud.

pranphy commented on July 21, 2024

Error with devanagari wordcloud.

Code from Here. Laila-Regular.ttf font from here.

top_dict = {'राष्ट्रिय': 500, 'नेपाली': 289, 'नेपाल': 248, 'सरकारले': 209, 'वर्ष': 199,  'व्यवस्था': 180, 'निर्माण': 158, 'स्थानीय': 156, 'अध्यक्ष': 145, 'महिला': 143, 'प्रमुख': 132, 'विकास': 132, 'रुपैयाँ': 127, 'योजना': 125, 'कार्यक्रम': 120, 'प्रतिशत': 120, 'पानी': 117, 'अन्तर्राष्ट्रिय': 116, 'प्रदेश': 109, 'सामाजिक': 109, 'सार्वजनिक': 108, 'रकम': 105, 'माग': 100, 'सडक': 100, 'गीत': 99, 'क्षेत्रमा': 98, 'विषयमा': 98, 'समस्या': 98}
def show_wordcloud(data):
    WC = WordCloud(
        font_path='Laila-Regular.ttf',
        background_color='white',
        #stopwords=stopwords,
        #max_words=1000,
        max_font_size=500, 
        scale=5,
        random_state=1 # chosen at random by flipping a coin; it was heads
    )
    #wordcloud = WC.generate(str(data))
    wordcloud = WC.generate_from_frequencies(data)

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    
    plt.imshow(wordcloud)
    #plt.savefig('Test.png',dpi=150,bbox_inches='tight')
    plt.show()

show_wordcloud(top_dict)

So the problem is it that, some of the joining glyphs are not joined. This font works fine with other text editors to show the joined glyphs. You can look at the first key of the dict to see how it should be actually rendered and the largest text in the image to see how it is rendered.
EDIT
Looked at the implementation of this function here. Tried to write text in image with PIL Image and seems the problem is there. Any suggestions?

from word_cloud.

amueller commented on July 21, 2024

@pranphy one option would be to try and debug why it's failing with PIL. Another would be to use a different rendering engine as suggested here #493 and here #58.

from word_cloud.

raselcse13 commented on July 21, 2024

i am using this text python wordcloud but word cloud is not properly visualize any one help me

মুস্তাফিজ কে সুইংও শিখতে হবে বিশ্বসেরা হতে হলে কাটারই যথেস্ট নয় সেদিন প্রতিপক্ষরা প্রতিমুহুর্তে দেখছিলাম একে অপরের সাথে একের পর এক পরামর্শ করে বলিং করছে আর শুভাগত যদি ননস্ট্রাইকিং প্রান্তের মুস্তাফিজকে আগেই কল করতো তাহলে সে রানআউট হতো না গুলিতো ক্রিকেট সেন্সের ব্যাপার- তাই না আমি মনে করি দল নির্বাচনে আমাদের ধারাবাহিক হতে হবে নিউজিল্যান্ড হচ্ছে সুইং বোলিং এর জন্য সেরা কন্ডিশন দেশের সেরা সুইংবোলার রুবেল কে বসিয়ে রাখছেন হাতুড়ে শুধু লেগস্পিন নিয়ে ভাবছেন লেগস্পিনার ছাড়াও অনেক দল ভাল করছে বাংলাদেশ যদি জেতে তাহলে আপনার কাজ নিয়ে কেউ কোন কথা বলবে না কিন্তু এইভাবে যদি হারেন তাহলে তো চলে না ভারত পাকিদের মতো ঘরে বাঘ বাইরে বিড়াল হতে চাইনা কোচ হাতুরু বাজে টীম সেলেক্শনকে ধামা চাপা দেবার জন্য সাফাই গাইছেন বাংলাদেশ দলে যেখানে মোসাদ্দেক মমিনুল মিরাজ ভালো ফর্মে রয়েছেন সেখানে শুভাগত কোনভাবেই জাতীয় দলে জায়গা পেতে পারেনা মুশফিকের বিকল্প বাংলাদেশ এ এখন ও নাই দলের সেরা ব্যাটসম্যান ইনজুরিতে পড়লে ভাবনাটা অস্বাভাবিক না টেষ্ট সিরিজের আগে মুশফিক না ফিরতে পারলে যে কি হবে তা ভেবে শিউরে উঠছি ওয়ানডে সিরিজে সিনিয়র খেলোয়াড় হিসেবে তাঁর ভূমিকা নিয়ে একটু প্রশ্ন উঠেছে সাকিব কেন ওয়ানডে ম্যাচগুলোয় আরও একটু দায়িত্ব নিয়ে খেললেন না—এমন প্রশ্ন ঘুরপাক খাচ্ছে অনেক ক্রিকেটপ্রেমীর মনেই আইপিএল নিষিদধ করা হোক শুধু আইপিএল না সব ধরনের টি-২০ নিষিদধ হোক সুপ্রিম কোর্ট ভারতীয় ক্রিকেট সংস্থায় গণতন্ত্র প্রতিষ্ঠা করছেন তাহলে তো নিজের জন্য ভাবাই ভালোকারণ এতে দলের লাভ হয়দলের রান বাড়েঅপনেন্টকে কম রানে আটকানো যায় এখন বলেন কোনটা ভালো ৩০-৪০ বল খেলে ২০ রান করে সেন্চুরি করে দলকে পরাজয়ের দিকে ঠেলে দেয়া না দলের প্রয়জনে নিজের সেন্চুরির কথা ভুলে জয়ের দিকে এগিয়ে নেয়া নিজে ভালো খেললে কিন্তু দলের ও উপকার হয় সাকিবের ভাবনাই ঠিক আছে সাকিব তার সাধ্যমত চেষ্টা করেতার ব্যাটিংএর ধরন সবসময়ই ঐরকম আমাদের দেশের সব প্রতিভাধর খেলোয়াড়দের সর্বনাশ করে আমাদের সংবাদমাধ্যম শুরুতেই মাথায় তুলে দিয়ে সবাই অপেক্ষা করছে সৌম্য কবে একটা ফিফটি করে সৌম্যের একজন ভালো মানের ব্যাটসম্যান ওর ব্যাটিং স্কিল ভালো ঘরোয়া-আন্তর্জাতিক সবজায়গায় গত এক বছর ধরে ক্রমাগত ব্যর্থ এই ব্যাটসম্যান এর দলে অন্তর্ভূক্তি নির্বাচকদের অদক্ষতারই প্রমাণ নাসির-নাফিস-বিজয়-রুবেলদের যেখানে আন্তর্জাতিক ম্যাচ থেকে বাদ দেয়ার পর ঘরোয়ায় ভালো করার পরও দদলে চান্স হচ্ছে না সেখানে সৌম্যের পারফরমেন্স পরীক্ষা করা হচ্ছে আন্তর্জাতিক ম্যাচে সৌম্য আসলে অনেক ভাগ্যবান বাংলাদেশ ক্রিকেটে মাশরাফি–সাকিব যা বলে সেটাই হয় তাহলে নি্র্বাচক কমিটির কি দরকার তাদের বাত দিয়ে দিন মাশরাফি–সাকিব যা বলে সেটাই হয় কিন্তু দল ঠিক করে পাপন সাব পাপন স্যার আপনি বলতে চাচ্ছেন সাকিব ও মাশরাফির কারনে নাসির বাদ পরেছে তার কথার সত্যমিথ্যা জানি নাকিন্তু যে দলটা দিল তা পছন্দ হয়েছে ম্যাচ হারলে সব দোষ সিনিয়রদের আমাদের স্যার শুভাগত স্যার তানভির এদের কোন দোষ নাইএদের তো নিউজিল্যান্ড ভ্রমনের জন্য পাঠানো হয়েছে বাংলাদেশকে জিততে হবে এটাই মূল কথা মাসরাফি বলেছিল ১ টি ভাল বল ১ জন ভাল বাটসমানের এনাফ পজেটিব হেডলাইন টাইগারেরা টি-টুয়েন্টিতে নিউজিল্যান্ডকে উড়িয়ে দেবে সৌম তার ধারাবাহিকতা বজায় রেখেছে হাথুরা বলল সৌম্যের গড় এখনও ৪০ কিন্ত গত ২০ ম্যাচে সৌম্যের গড় কত ১৫ এরও নীচেএকজন কোচ হয়ে এমন কথা কিভাবে বলে মুমিনুলের গড়ও ৪০তাহলে মুমিনুল কে নেয়া হচ্ছে না কেন লেগ স্পিনার ছাড়াই ভারত ইংল্যান্ডকে ওয়ানডে ও টেস্টে হোয়াইট ওয়াশ করল টেস্ট ক্রিকেটের দ্রুততম সেঞ্চুরি করেছেন ভিভ রিচার্ডস ও মিসবা উল হক যুগ্মভাবে আমরা আমাদের মাঠে ওদের হারিয়েছি এবার ওদের সময়ফিল্ডিং এ ও আমাদের অনেক উন্নতি করতে হবে আমাদের সৌম্যের ব্যটিং দেখে উদ্বেলিত ব্যাট হাতে শুন্য রান করে পরে এক অভারে দিল ১৭ রান যদিও বাংলাদেশ বড় স্কোর করতে পারেনিকিন্তু ১৪ ওভার পর্যন্ত ম্যাচটা বাংলাদেশের দিকেই ছিল একটা ব্যাটসম্যান যখন লাইফ পায় তখন সে যে কোনো কারো দলের জন্য হুমকি হতে পারে এটাই স্বাবাবিক ঠিক মত দল সিলেক্ট করতে পারেন না আবার বড় সংগ্রহ চায় বাংলাদেশ দলের ব্যাটিং দেখে আমি অত্যন্ত বিরক্ত

from word_cloud.

raselcse13 commented on July 21, 2024

here is my python code
import os
import bangla
from os import path
import unicodedata

from wordcloud import WordCloud

d = path.dirname(file) if "file" in locals() else os.getcwd()
text = datavalue_2

#text = bangla.datavalue_2

#wordcloud = WordCloud().generate(text)

import matplotlib.pyplot as plt
#plt.imshow(wordcloud, interpolation='bilinear')
#plt.axis("off")

wordcloud = WordCloud(max_font_size=100,width=1200, height=800).generate(text)

plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

from word_cloud.

pranphy commented on July 21, 2024

I would say give full path of the font with these glyphs in your WordCloud call
wordcloud = WordCloud(max_font_size=100,font_path='full/path/to/font', width=1200, height=800).generate(text)
If it works great, otherwise if you run into issue of incorrect glyphs placement, then it is most likely because of incorrect rendering by pillow, in which case you can try the suggestions here #493

from word_cloud.

AbdulwahabDev commented on July 21, 2024

FreeSerif.ttf

fix my arabic issue thanks @bakrianoo , @amueller

from word_cloud.

008karan commented on July 21, 2024

any font which support multiple languagee. I need to build for multiple Indian languages

from word_cloud.

Shanky-21 commented on July 21, 2024

I am working on Hindi language and have used Lohit-devanagri.ttf font. as you can see that my world cloud is generated but it doesn't contain any meaningful words that is its not picking up words to determine frequency what should i do

from word_cloud.

amueller commented on July 21, 2024

Sorry, I can not read Hindi so it's not clear to me from the image why they are not meaningful words. Should splitting by whitespace work for the language?

from word_cloud.

Shanky-21 commented on July 21, 2024

Sorry, I can not read Hindi so it's not clear to me from the image why they are not meaningful words. Should splitting by whitespace work for the language?

Yes, splitting by whitespaces is working. Almost 90% of the words shown in the wordcloud are actually continuous substrings of the actual words. I can give you a example:
suppose I passed a text file " Hello, I am from India "*100 to generate wordcloud. but when wordcloud is shown the you will only see words like "ello", "lo", "rom", "In", "am fr", "dia". That means It is picking up substring of words.

Same problem is occurring with Hindi text. but splitting by whitespaces works correctly . I don't know what to do.

from word_cloud.

amueller commented on July 21, 2024

Can you check if splitting by the regexp that's the default is working? If not, can you find a regexp that is working? Check out #562 for some context.

from word_cloud.

Word cloud unicode (multilanguage) problem about word_cloud HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent