Comments (8)
Hey @alexkreidler thanks for the suggestion. Can you show how you can do this using your Mistral example? An example to show how to take in the batch of strings and how you return it would be helpful for us to implement this interface. Also feel free to implement it yourself if it is faster than way
from deepeval.
Hi @penguine-ip ,
just chiming in that i would be interested in this feature as well. Im working with the official mistral repository: https://github.com/mistralai/mistral-src and using a pruned version of the orginal mistral model. In their main, the implement a generate()
method which can take a list of prompts as input. Therefore, an interface something like this would be very useful:
class CustomMistral(DeepEvalBaseLLM):
def __init__(self, model):
self.model = model
...
...
...
def generate(self, prompts: list) -> list:
...
results = model.generate(prompts)
... # format results correctly
return results
Being able to generate responses by using batched requests in each forward pass significantly reduces compute time (depending on the size of the batch). In my own reference implementation of MMLU eval a batch size of 8 reduced compute time by a factor of 7, compared to a batch size of 1.
from deepeval.
@kritinv I think there could be a generate_batch()
just for the benchmarks?
@Falk358 Are there limits to the batch size for your mistral example?
from deepeval.
@penguine-ip As far as I know, mistral's generate
method doesn't impose any batch size limits on the user. The underlying pytorch code will throw a CudaOutOfMemoryError
if GPU Memory is full. In my concrete case, im running an rtx 3090, which means that the max batch size i can use is 8 ( 8 llm requests per forward pass).
I think a seperate generate_batch()
sounds like a very good solution. It would definetly be compatible with my use case, provided I can control the batch size passed to it in the parameter somehow.
from deepeval.
@Falk358 and @alexkreidler, this took a lot longer than i thought, but it is out: https://docs.confident-ai.com/docs/benchmarks-introduction#create-a-custom-llm
Can you please check if it is working (latest release v0.21.43) and whether the example in the docs is correct? Thanks!
from deepeval.
Hi @penguine-ip,
thanks for the swift implementation!
Unfortunately, there seems to be a problem with the evaluation for my generate()
function when using release v021.43. I was using v021.36 previously. This is my generate
method in v0.21.36:
def generate(self, prompt: str) -> str:
model = self.load_model()
final_prompt = f"[INST]{prompt}[/INST]"
result, _ = generate(
prompts=[final_prompt],
model=model,
tokenizer=self.tokenizer,
max_tokens=self.max_tokens,
temperature=self.temperature,
)
answer = result[0].rsplit(sep="[/INST]", maxsplit=1)[1]
answer = answer.strip()
if len(answer) == 0:
answer = " " # return non-zero string to avoid crashes during eval
return answer
this code calls the main.generate()
function from https://github.com/mistralai/mistral-src, which returns a list of strings (called result
in the code above). Each entry in this list has the following format: "[INST]prompt_passed_to_model[\INST]answer_of_model". Therefore, i split the answer after the closing "[\INST]" token and take the rest to obtain the model answer. The answer is also stip()
ed to avoid leading whitespace being evaluated further down the pipeline. My model generates further explanation to the answer which i keep. In v0.21.36 this did not break evaluation on the "high_school_european_history" subset of MMLU (i get an accuracy of 0.6 for my experiment), in v0.21.43 this breaks and i get an accuracy of 0.
All of this broke without any other changes to the class, I did not start writing batch_generate()
.
Is my code flawed in some way or is this a bug?
Thanks so much for your help!
Max
from deepeval.
Hey @Falk358 , hard to tell immediately just from looking, do you have a forked version? I can show you where to add a single line of print statement to know if this is the expected behavior, let me know!
from deepeval.
Hi @penguine-ip,
https://github.com/Falk358/mistral-src is the forked repo. You can find my implementation in the file mistral_wrapper_lm_eval.py
.
It would be helpful to have some more exact specification what kind of answer format (beyond it being a string) the generate method should return, I could not find anything in deepeval
's docs so far. For example, maximum length of the string, whether it should just have certain formatting.
I believe that the way deepeval
evaluates MMLU
changed in the new release version. In the old version it probably was able to extract "A", "B" "C" or "D" from the return value of generate()
, while this doesnt seem to be the case anymore.
I checked the contents of answer
in v0.21.43 and it looked somewhat like this: "A. Text from prompt after option A\n\n Some more content", where the second sentence was cut short due to my max_tokens limit. This is expected behaviour (it had the same layout in v0.21.36, where it still evaluated correctly). Hope this helps!
Kind Regards,
Max
from deepeval.
Related Issues (20)
- Selective records failure instead of Complete Job Failure HOT 2
- `ignore_errors` doesn't work as expected if `show_indicator` is set to False HOT 2
- accuracy always comes 0, might be a bug in my code i am unable to find.
- Message about nest_asyncio Should Not Be Printed HOT 2
- Add Support for `Gemini` Models HOT 6
- Conversation Evaluations
- MMLU stopped working
- Invalid value for 'top_logprobs': must be less than or equal to 5 for AzureOpenAi Models
- AuthenticationError HOT 2
- Update to tenacity 8.4.1 HOT 3
- Bug in G-Eval metrics when calculating the weighted_summation_score
- Disable Update Warnings HOT 1
- Error when generating an evaluation dataset according to the official documentation
- `GSM8KTemplate` does not yield a standard template format for 0-shot chain-of-thought prompting. HOT 1
- Multimodal evaluation HOT 1
- Error while using the most popular opensource chat models HOT 2
- Enhancing DeepEval's Vertex AI implementation sample HOT 1
- "deepeval login --api-key" throws error “TypeError: Option() missing 1 required positional argument: 'default’"
- The Colab notebook left as an example in the LLamaIndex Docs is not working! HOT 1
- Stuck while suing summarisation metric HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepeval.