I'm working with a corpus that primarily consists of longer documents. I'm seeking rec

Semantic Sentence Tokenization about bertopic HOT 1 OPEN

TheAIMagics commented on June 8, 2024

Semantic Sentence Tokenization

from bertopic.

Comments (1)

MaartenGr commented on June 8, 2024

Hi! I might be mistaken but I do not believe there is a technique commonly used for these kinds of semantic sentence tokenization since the separation of the original highly depends on the abstraction level of the semantic separation. There are small tricks like using conjunctions and sentence splitters to create candidate splits and then using embeddings to model their potential differences.

For instance, you could split the input using a sentence splitter and then further split the sentences based on whether a conjunction exists in these sentences. Then, the resulting candidate phrases/sentences are embedded using any embedding technique. Finally, sequential candidate phrases are merged if they are similar enough (user-specified threshold).

It's not perfect but the general principle (at least in my head) seems like it might actually work.

from bertopic.

Semantic Sentence Tokenization about bertopic HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent