Comments (11)
That is interesting. How large is you text?
from word_cloud.
hi, i want perform analysis on 4gb text of unstructured data, its unix mailbox mbox format
but it fail on 400gb file of same mbox format as tried this 1st time
my plan is test it on only extracted mail subjects without header data
did few googling i think that regexp is somehow wrong..
Od: Andreas Mueller [email protected]
Komu: "amueller/word_cloud" [email protected]
Datum: 25.02.2015 15:09
Předmět: Re: [word_cloud] MemoryError (#44)That is interesting. How large is you text?
—
Reply to this email directly or view it on GitHub #44 (comment).
from word_cloud.
correction: but it fail on 400MB file of same mbox format as tried this 1st time
from word_cloud.
That is what I thought ;) How large is your ram? I am quite surprised. I don't think it is a problem with the regexp. Btw, can you maybe try the scikit-learn CountVectorizer to see what that does on your data?
from word_cloud.
Traceback (most recent call last):
File "simple.py", line 16, in
wordcloud = WordCloud(font_path='Verdana.ttf').generate(text)
File "C:_pythonport\python-2.7.9\lib\site-packages\wordcloud\wordcloud.py", l
ine 312, in generate
self.process_text(text)
File "C:_pythonport\python-2.7.9\lib\site-packages\wordcloud\wordcloud.py", l
ine 259, in process_text
for word in re.findall(r"\w[\w']*", text, flags=flags):
File "C:_pythonport\python-2.7.9\lib\re.py", line 181, in findall
return _compile(pattern, flags).findall(string)
MemoryError
still same after ram upgrade to 8GB on 64bit Win
- for any guidance how to try with scikit-learn you recommending will be welcomed
from word_cloud.
Actually, I don't think scikit-learn would help as it uses the same tokenization.
Could it be that you installed a 32 bit python? That would be restricted to 2gb of ram IIRC.
from word_cloud.
Why not use re.finditer
?
Also, there are a few places in the code where iterators could be used instead of lists.
from word_cloud.
@terrycojones yeah but I would be very surprised if there was any real impact on memory usage. Apart from this one, I guess. The current process_text
code is not written by me and not as carefully reviewed as I probably should have, so it has some edges to it.
from word_cloud.
Same problem here. First with normalize_plurals. Setting it to False generate the same error when removing the stop words. However, I have already done it, but cannot prevent this step to happen, as far as I understood.
from word_cloud.
@paulaceccon Sorry, I don't understand the connection to normalize_plurals
that shouldn't have an effect. How large is your data and how much ram do you have? Could you share your data?
Thanks!
Andy
from word_cloud.
closing as need data to reproduce.
from word_cloud.
Related Issues (20)
- Custom and radom Font Size
- Unable to download wordcloud using python3.12 HOT 7
- MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 HOT 1
- MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 HOT 2
- Error in installing library wordcloud HOT 4
- ValueError: anchor not supported for multiline text HOT 3
- Maybe get rid of Cython HOT 4
- ValueError: Only supported for TrueType fonts on Databricks
- How to fix? : python setup.py egg_info did not run successfully. HOT 3
- Wordcloud not Support Multiple Languages in Once? Unicodes, Emoji, Glyphs all are not Works.
- Generating a wordcloud from a dictionary of terms with their frequencies HOT 1
- "AttributeError: 'ImageDraw' object has no attribute 'textbbox'" on Ubuntu 20.04 HOT 1
- test_cli_as_executable test failed on 1.9.3
- Works on Android Termux - quick tips
- Wordcloud returns an error when files exceed 15MB HOT 3
- Keeping together multi-word tokens
- clickable wordclouds
- Questions about Copilot + Open Source Software Hierarchy
- file `stopwords` is missing in the final wheel
- wordcloud not compatible with matplotlib 3.9.0 release on 16/05/2024 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from word_cloud.