Comments (1)
I am not sure as to why you would get only 1 cluster. It could be that your data is relatively small or that the minimum cluster size is too large. It might even be that your sentences are too short or overcleaned after preprocessing.
Another reason might be due to Umap, it is stochastic by nature which results in different results every time you run it. It might be strange advice, but perhaps try running it again?
When it comes to long text files, there are several methods for extracting the correct embeddings. Currently, you only input documents into BERTopic. However, I am planning on creating an option for using your own embeddings. This would allow you to embed paragraphs and apply mean pooling to merge them to a single embedding for the long text. Similarly, it would also allow you to choose any other transformer model that might be better suited for embedding long texts.
from bertopic.
Related Issues (20)
- Huggingface transformer does not load as expected HOT 3
- BERTopic with large dataset (10-20 Million) HOT 1
- datamap visualisation does not work. HOT 1
- datamap visulisation does not work. HOT 4
- Request: Zeroshot option to assign unassigned documents to outliers rather than reclustering HOT 3
- should we reduce the dimensionality of topic_model.topic_embeddings_ ? HOT 2
- Bertopic verion 0.16.1 fails with zero_shot topics (works fine with 0.16.0) HOT 1
- bertopic version 0.16.0 - probs are empty when executing with zero_shot HOT 3
- bertopic version 0.16.0 - when adding representation model together with zeroshot_topic_list end with failure HOT 3
- Disable warning for update_topics() HOT 1
- Wrong link in Algorithm Documentation HOT 2
- Getting probabilities for all topics given a document from loaded model HOT 1
- Issues with Zero-shot Topic Modeling regarding outliers and future operations HOT 3
- Switch from setup.py to pyproject.toml HOT 4
- Seed Words
- random openai issue with plain bertopic use HOT 18
- Nan Representative Docs when loading a serialized model HOT 1
- ModuleNotFoundError: Can't use LangChain with version 0.16.0 HOT 1
- Should raise an Exception when tokenizer is not defined HOT 1
- Handle Responsible AI scenarios for OpenAI HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bertopic.