Code Monkey home page Code Monkey logo

semantic-chunkers's People

Contributors

andreped avatar anush008 avatar ashraq1455 avatar avatsaev avatar bdqfork avatar bruvduroiu avatar cp500 avatar digriffiths avatar dwmorris11 avatar hananell avatar hananelll avatar italianconcerto avatar jamescalam avatar jzcruiser avatar kdcokenny avatar maxyousif15 avatar mesax1 avatar shaungt1 avatar siddicky avatar simjak avatar siraj-aizlewood avatar smwitkowski avatar szelesaron avatar the-anup-das avatar theanupllm avatar tolgadevai avatar zahid-syed avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

semantic-chunkers's Issues

Statistical Chunker execution time benchmark and discussion

Hello, I am writing because I have been using your Statistical Chunker to perform semantic chunking on a large (but not too large text dataset). The quality of the produced chunks is rather good and the algorithm is rather understandable from my part.

However, it took me quite many days to chunk my dataset on a consumer grade cpu. Since my dataset has grown I looked for alternatives and found the semchunk library. It seems like they have a much more greedy algorithm but with the trade-off of speed.

Based on these observations, I set out to benchmark the execution times of both.

I used the first 1000 rows of PORTULAN/parlamento-pt, a dataset with old portuguese legislative texts. The model used was marquesafonso/albertina-sts, the one I am using in my project, which it is rather small.

Here are the results in terms of execution time:

Model Time (seconds)
semantic-chunkers 13124.23
semchunk 8.81

Both models were initialized first and then run as lambda functions over a dataframe:

semantic_chunker = SemanticChunker(model_name=model_name)
df = df.with_columns(pl.col("text").map_elements(lambda x: semantic_chunker.chunk(x), return_dtype=pl.List(pl.Utf8)).alias("chunk_text"))

The SemanticChunker class was set up as follows:

class SemanticChunker:
    def __init__(self, model_name):
        self.chunker = StatisticalChunker(encoder=HuggingFaceEncoder(name=model_name), window_size=2, max_split_tokens=300)
    
    def chunk(self, text):
        chunks = self.chunker(docs=[text])
        return [" ".join(chunk.splits) for chunk in chunks[0]]

More important than the comparison with the other library, in my opinion, is that it took approximately 3.65 hours to chunk 1000 legislative documents using the Statistical Chunker.

While the quality is very good from my experience, and in the past I used multi-threading to speed up inference times, it renders the method difficult to include in a workflow, making it more sensible even to consider character-based chunking techniques.

The purpose of this issue is not to bash the library but rather to share my feedback and findings with you as well as start a discussion on what could be causing these very long executions times + possible fixes.

Thanks for sharing this library openly btw!

Large document fix

Large documents chunking memory error

@ashraq1455 encountered an error where if a document were sufficiently large enough the worker would shutdown. The suspected cause was due to our semantic_chunkers.StatisticalChunker encoding sentence embeddings and storing all of them in-memory at the same time for the chunking methodology to run.

The proposed solution here would be to add a “rolling window” of focus that embeds a maximum number of sentences at any one time. This fix should be applied to both ConsecutiveChunker and StatisticalChunker in the semantic_chunkers library.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.