Code Monkey home page Code Monkey logo

Comments (2)

MaartenGr avatar MaartenGr commented on June 2, 2024

I found that the values remained constant until the PartOfSpeech representation module and switching it to another one resolved the issue. The problem appears to be initially caused by the deduplication method (list(set())) used at lines 121 and 130. Because the hash function used for generating set keys is seeded at the start of the interpreter (seed can be overridden using PYTHONHASHSEED env variable), the output of such deduplication is different with each run. This behavior causes word_indices at line 144 to change with each run. Which is later problematic when sorting keywords with the same c-TF-IDF as they are arranged differently.

Amazing, great catch! That's also a nasty habit of mine so I wouldn't be surprised if that happens in other places as well.

Part 1
Sort the word_indices at line 144 using numpy. This will ensure consistent ordering of words, should be faster than built-in sort, and will transform them into numpy array for further operations.
Remove the numpy array creation at line 145, as its handled by previous step.

Sounds good and a minimal change as well, which I prefer!

When looking at the PoS code, I noticed that word_indices at line 144 are generated using the following condition if words_lookup.get(keyword) which ignores the first word returned by get_feature_names_out. It looks like an error.

I'm not sure if I understand correctly. Why would the first word be ignored?

from bertopic.

Greenpp avatar Greenpp commented on June 2, 2024

I'm not sure if I understand correctly. Why would the first word be ignored?

That's because of how the values are converted to booleans. At line 140 a lookup is created that maps each word to its index (0 based), which is later used at line 144 to extract the indices. This lookup output is filtered using the condition if words_lookup.get(keyword) to prevent None values, as .get method on a dictionary returns None as a default value if the key isn't found. However, the same condition will also evaluate to False for the index 0, causing it to be ignored.

Example

[v for v in [1, None, 2, 0, 3] if v]
# [1, 2, 3]

from bertopic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.