Code Monkey home page Code Monkey logo

Comments (11)

amueller avatar amueller commented on July 21, 2024

That is interesting. How large is you text?

from word_cloud.

jorge80 avatar jorge80 commented on July 21, 2024

hi, i want perform analysis on 4gb text of unstructured data, its unix mailbox mbox format
but it fail on 400gb file of same mbox format as tried this 1st time
my plan is test it on only extracted mail subjects without header data
did few googling i think that regexp is somehow wrong..
 


Od: Andreas Mueller [email protected]
Komu: "amueller/word_cloud" [email protected]
Datum: 25.02.2015 15:09
Předmět: Re: [word_cloud] MemoryError (#44)

That is interesting. How large is you text?

Reply to this email directly or view it on GitHub #44 (comment).

from word_cloud.

jorge80 avatar jorge80 commented on July 21, 2024

correction: but it fail on 400MB file of same mbox format as tried this 1st time

from word_cloud.

amueller avatar amueller commented on July 21, 2024

That is what I thought ;) How large is your ram? I am quite surprised. I don't think it is a problem with the regexp. Btw, can you maybe try the scikit-learn CountVectorizer to see what that does on your data?

from word_cloud.

jorge80 avatar jorge80 commented on July 21, 2024

Traceback (most recent call last):
File "simple.py", line 16, in
wordcloud = WordCloud(font_path='Verdana.ttf').generate(text)
File "C:_pythonport\python-2.7.9\lib\site-packages\wordcloud\wordcloud.py", l
ine 312, in generate
self.process_text(text)
File "C:_pythonport\python-2.7.9\lib\site-packages\wordcloud\wordcloud.py", l
ine 259, in process_text
for word in re.findall(r"\w[\w']*", text, flags=flags):
File "C:_pythonport\python-2.7.9\lib\re.py", line 181, in findall
return _compile(pattern, flags).findall(string)

MemoryError

still same after ram upgrade to 8GB on 64bit Win

  • for any guidance how to try with scikit-learn you recommending will be welcomed

from word_cloud.

amueller avatar amueller commented on July 21, 2024

Actually, I don't think scikit-learn would help as it uses the same tokenization.
Could it be that you installed a 32 bit python? That would be restricted to 2gb of ram IIRC.

from word_cloud.

terrycojones avatar terrycojones commented on July 21, 2024

Why not use re.finditer?

Also, there are a few places in the code where iterators could be used instead of lists.

from word_cloud.

amueller avatar amueller commented on July 21, 2024

@terrycojones yeah but I would be very surprised if there was any real impact on memory usage. Apart from this one, I guess. The current process_text code is not written by me and not as carefully reviewed as I probably should have, so it has some edges to it.

from word_cloud.

paulaceccon avatar paulaceccon commented on July 21, 2024

Same problem here. First with normalize_plurals. Setting it to False generate the same error when removing the stop words. However, I have already done it, but cannot prevent this step to happen, as far as I understood.

from word_cloud.

amueller avatar amueller commented on July 21, 2024

@paulaceccon Sorry, I don't understand the connection to normalize_plurals that shouldn't have an effect. How large is your data and how much ram do you have? Could you share your data?

Thanks!

Andy

from word_cloud.

amueller avatar amueller commented on July 21, 2024

closing as need data to reproduce.

from word_cloud.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.