Comments (1)
Hi! I might be mistaken but I do not believe there is a technique commonly used for these kinds of semantic sentence tokenization since the separation of the original highly depends on the abstraction level of the semantic separation. There are small tricks like using conjunctions and sentence splitters to create candidate splits and then using embeddings to model their potential differences.
For instance, you could split the input using a sentence splitter and then further split the sentences based on whether a conjunction exists in these sentences. Then, the resulting candidate phrases/sentences are embedded using any embedding technique. Finally, sequential candidate phrases are merged if they are similar enough (user-specified threshold).
It's not perfect but the general principle (at least in my head) seems like it might actually work.
from bertopic.
Related Issues (20)
- ModuleNotFoundError: Can't use LangChain with version 0.16.0 HOT 1
- Should raise an Exception when tokenizer is not defined HOT 1
- Handle Responsible AI scenarios for OpenAI HOT 2
- Warn when automatically choosing SklearnEmbedder backend HOT 3
- PartOfSpeech representation reproducibility and word with index 0 HOT 2
- Zero-Shot HOT 2
- Supervised topic model generating different topics to training data HOT 3
- Where is the full data set of embeddings? HOT 3
- Visualization in html page HOT 1
- Guided Modeling: Problem with seed_topic_list HOT 2
- Utilizing the GPU of MacBook Pro M3 to accelerate the process of fit_transform HOT 1
- Could we know the weights of each topic? HOT 6
- Can't reproduce same results when using cuml version of UMAP and HDBSCAN HOT 3
- approximate_distribution returns only 0s HOT 5
- Feature (Watsonx): representations using Llama-3-70b and Mixtral-8x7b HOT 1
- Which hyper parameter mostly influence the number of topics for Chinese texts? HOT 3
- Zero-Shot Topic Modelling and Topics Over Time HOT 1
- Loading of saved model returns Error: "This BERTopic instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator."
- Creating representations using IBM Watsonx LLMs HOT 5
- c_tf_idf_ is None when using zero shot topic modeling. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bertopic.