Comments (7)
Re 1:
That sounds awesome! I tried the HNSW index to reduce the inference time efficiency (at the additional storage costs) and it gives us some performance deterioration, possibly due to bad hyperparameters. Reducing the index size is another issue that is helpful. Do you mind sharing more details?
Re 2:
Sections 3.3 and Appendix Section A.3 explain why we did that, but it's essentially because we want to adjust retrieval frequency for different downstream tasks. As shown in Figure 4, retrieving more helps a lot on PopQA, which is dominated by many rare entities, while it doesn't affect much for PubHealth (claim verification task). Our Table 3, Hard constraints
baselines show the performance when we only retrieves when the retrieval tokens are generated.
from self-rag.
Re 1:
Awesome.
Another thing I did was move away from loading all passages into memory (this + the original FAISS index blew up my instance with ~80gb of mem). Further, I saw that inference times were on the order of 20s with the implementation in this repo, optimizing the FAISS index reduced the size 90% and inference speed to ~10ms. I haven't extensively checked the evaluation performance, but doing some spot checks it all looked reasonable.
The script for generating the compressed index is here. The db implementation is here. I am planning on hosting the index + a self-rag model today that others can access once I have smoothed out a few rough edges around the infra.
I am very inspired by this work, and so I am also building a simplified approach to using models which follow this format. I have built a simple framework that allows one to attach a local LLM to a remote vLLM provider + rag db server (boiler plate implementation is in sciphi-infra). And I have worked out a simple way to do inference.
My goal is to reduce the interface for creating a self-rag LLM to that shown below:
llm = SciPhiLLM(
SciPhiConfig(
server_base="http://localhost:8000/v1",
rag_provider_base="http://localhost:8001/search",
rag_provider_token="",
)
)
w/ the option to just use local vllm if one doesn't want to make an server on their instance. This is something which I have working in my local setup and just need to cleanup / push.
Lastly,
I am re-running your fine-tune on Mistral to see how impacts your findings. I have generally found Mistral to be much better than llama-7b, so I'm excited to see how that shakes out.
Re 2:
Ah, thank you for the thoughtful explanation. I hope it is not a terrible approximation for now to simply stop on '' tokens and do the retrieval when the retrieval token precedes this. I will look into implementing the more complicated logic at a later date.
from self-rag.
Great discussion about optimization. Iām hoping there is limited performance regressions with the optimized code and when trying it out on Mistral!
FYI, added issue on license with your inference code @emrgnt-cmplxty
from self-rag.
Thanks, that's Apache 2.0 - I uploaded the license.
from self-rag.
Hi @emrgnt-cmplxty thank you so much for all of the contributions!! I will try to reduce the inference time memory requirement for retrieval part following your suggestions & snippets!
from self-rag.
Closing this issue now, but feel free to reopen it if you want! Thank you so much for such an amazing followup, @emrgnt-cmplxty !
from self-rag.
Re 1:
Awesome.
Another thing I did was move away from loading all passages into memory (this + the original FAISS index blew up my instance with ~80gb of mem). Further, I saw that inference times were on the order of 20s with the implementation in this repo, optimizing the FAISS index reduced the size 90% and inference speed to ~10ms. I haven't extensively checked the evaluation performance, but doing some spot checks it all looked reasonable.
The script for generating the compressed index is here. The db implementation is here. I am planning on hosting the index + a self-rag model today that others can access once I have smoothed out a few rough edges around the infra.
I am very inspired by this work, and so I am also building a simplified approach to using models which follow this format. I have built a simple framework that allows one to attach a local LLM to a remote vLLM provider + rag db server (boiler plate implementation is in sciphi-infra). And I have worked out a simple way to do inference.
My goal is to reduce the interface for creating a self-rag LLM to that shown below:
llm = SciPhiLLM( SciPhiConfig( server_base="http://localhost:8000/v1", rag_provider_base="http://localhost:8001/search", rag_provider_token="", ) )w/ the option to just use local vllm if one doesn't want to make an server on their instance. This is something which I have working in my local setup and just need to cleanup / push.
Lastly,
I am re-running your fine-tune on Mistral to see how impacts your findings. I have generally found Mistral to be much better than llama-7b, so I'm excited to see how that shakes out.
Re 2:
Ah, thank you for the thoughtful explanation. I hope it is not a terrible approximation for now to simply stop on '' tokens and do the retrieval when the retrieval token precedes this. I will look into implementing the more complicated logic at a later date.
Hi, I guess the link is out-of-date, could you please update this link? I would like to look into details about the optimization in memory during the wiki-reader and FAISS embedding process. @emrgnt-cmplxty
from self-rag.
Related Issues (20)
- Explanation needed for [Continue to Use Evidence] HOT 1
- How can I get initial input file for generator?
- model issues
- Processed Input Dataset and Flan-3B Critic Generated Dataset
- Reproducing Self-RAG
- accuracy metric HOT 3
- About parameter `max_depth` HOT 2
- Doesn't the generator need to call the retriever when training the model?
- The critic model will generate different type of token when I use run_reward_vllm.py to generate tokens HOT 1
- some problem with run_long_form_static.py
- Data formatting to call the retriever
- Question Regarding Formula Error in Your Paper
- FactScore Inference Fails with KeyError: 'original_splitted_sentences'
- Incorrect setup of Learning Rate Scheduler HOT 6
- dependency HOT 1
- torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: HOT 2
- CUDA Memory is not enough
- Max_logprobs and logprobs value
- How to curate the preceding sentences? and Can you inform the distribution of IsUse token (1~5)?
- About bio eval
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from self-rag.