Comments (6)
Ah ok I got it.
it makes sense
from core.
Correct me if I'm misunderstanding, but I'm not sure adding it before the parsing happens is the right place.
At that point, you have file bytes and you should parse them to remove the noise.
Hence, you could post process the text before the rabbit hole splits the text or inside the parse
itself.
In the last case, you could override the BeautifulSoup parser and perform parsing and post processing all at once
from core.
Hi @nicola-corbellini, at that point the HTML for example is already stripped, so I can't do parsing to remove noise based on some tag or attribute.
from core.
Ah my bad, didn't think about that. Than a custom parser is the best solution, I think.
We are relying on Langchain, which, in turn, uses BeautifulSoup (ref here).
Maybe tuning some params could be enough, but implementing a new one guarantees full control.
from core.
If you agree I could do a proposal on this ;-)
from core.
Mmm ok...in my mind the solution was to just make a plugin with your own parser and replace the default one with the rabbit_hole_instantiates_parsers hook, but if you have a proposal I'm happy to hear it
from core.
Related Issues (20)
- [BUG] AttributeError: 'coroutine' object has no attribute 'get' HOT 15
- [BUG] TypeError: '_io.BufferedRandom' HOT 3
- BUG @hook After_Rabbit_Hole_splitted_text HOT 5
- [Feature] Using metadata to filter declarative memory enhances response accuracy HOT 2
- [Security] Add Dependency bot HOT 1
- [Code] Add python linter HOT 2
- [Refactor] Abstract Vector Memory and simple api from plugins HOT 5
- [Feature] Bash install scripts set for the cat HOT 4
- [Feature] Fallback to handler pip errors HOT 1
- [BUG] using Gemini LLM doesn't work in main branch HOT 2
- [BUG] Experimental message endpoint is broken
- [BUG] Using Gemini LLM allows you to just get one result HOT 1
- [BUG] Hooks with same priority collide HOT 3
- [Feature] Add hook Before Websocket Connection Is Accepted HOT 1
- [Feature]: Add groq llms HOT 1
- [BUG] Interface does not return error on not pdf files HOT 1
- [Feature] WhiteRabbit expansion HOT 2
- [ERROR] AsyncCompletions.create() got an unexpected keyword argument 'api_type'" HOT 2
- [BUG] Cohere API requires array of strings for stop sequence HOT 2
- Close ws connections inactive for more than x seconds
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from core.