Is your feature request related to a problem? Please describe. I'

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Support batching for custom models about deepeval HOT 8 OPEN

alexkreidler commented on July 22, 2024 1

Support batching for custom models

from deepeval.

Comments (8)

penguine-ip commented on July 22, 2024

Hey @alexkreidler thanks for the suggestion. Can you show how you can do this using your Mistral example? An example to show how to take in the batch of strings and how you return it would be helpful for us to implement this interface. Also feel free to implement it yourself if it is faster than way

from deepeval.

Falk358 commented on July 22, 2024

Hi @penguine-ip ,

just chiming in that i would be interested in this feature as well. Im working with the official mistral repository: https://github.com/mistralai/mistral-src and using a pruned version of the orginal mistral model. In their main, the implement a generate() method which can take a list of prompts as input. Therefore, an interface something like this would be very useful:

class CustomMistral(DeepEvalBaseLLM):
     def __init__(self, model):
         self.model = model
     ...
     ...
     ...
    def generate(self, prompts: list) -> list:
        ...
        results = model.generate(prompts)
        ... # format results correctly
        return results

Being able to generate responses by using batched requests in each forward pass significantly reduces compute time (depending on the size of the batch). In my own reference implementation of MMLU eval a batch size of 8 reduced compute time by a factor of 7, compared to a batch size of 1.

from deepeval.

penguine-ip commented on July 22, 2024

@kritinv I think there could be a generate_batch() just for the benchmarks?

@Falk358 Are there limits to the batch size for your mistral example?

from deepeval.

Falk358 commented on July 22, 2024

@penguine-ip As far as I know, mistral's generate method doesn't impose any batch size limits on the user. The underlying pytorch code will throw a CudaOutOfMemoryError if GPU Memory is full. In my concrete case, im running an rtx 3090, which means that the max batch size i can use is 8 ( 8 llm requests per forward pass).

I think a seperate generate_batch() sounds like a very good solution. It would definetly be compatible with my use case, provided I can control the batch size passed to it in the parameter somehow.

from deepeval.

penguine-ip commented on July 22, 2024

@Falk358 and @alexkreidler, this took a lot longer than i thought, but it is out: https://docs.confident-ai.com/docs/benchmarks-introduction#create-a-custom-llm

Can you please check if it is working (latest release v0.21.43) and whether the example in the docs is correct? Thanks!

from deepeval.

Falk358 commented on July 22, 2024

Hi @penguine-ip,

thanks for the swift implementation!
Unfortunately, there seems to be a problem with the evaluation for my generate() function when using release v021.43. I was using v021.36 previously. This is my generatemethod in v0.21.36:

 def generate(self, prompt: str) -> str:
        model = self.load_model()
        final_prompt = f"[INST]{prompt}[/INST]"

        result, _ = generate(
            prompts=[final_prompt],
            model=model,
            tokenizer=self.tokenizer,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
        )
        answer = result[0].rsplit(sep="[/INST]", maxsplit=1)[1]
        answer = answer.strip()
        if len(answer) == 0:
            answer = " "  # return non-zero string to avoid crashes during eval
        return answer

this code calls the main.generate() function from https://github.com/mistralai/mistral-src, which returns a list of strings (called result in the code above). Each entry in this list has the following format: "[INST]prompt_passed_to_model[\INST]answer_of_model". Therefore, i split the answer after the closing "[\INST]" token and take the rest to obtain the model answer. The answer is also stip()ed to avoid leading whitespace being evaluated further down the pipeline. My model generates further explanation to the answer which i keep. In v0.21.36 this did not break evaluation on the "high_school_european_history" subset of MMLU (i get an accuracy of 0.6 for my experiment), in v0.21.43 this breaks and i get an accuracy of 0.
All of this broke without any other changes to the class, I did not start writing batch_generate().

Is my code flawed in some way or is this a bug?

Thanks so much for your help!

Max

from deepeval.

penguine-ip commented on July 22, 2024

Hey @Falk358 , hard to tell immediately just from looking, do you have a forked version? I can show you where to add a single line of print statement to know if this is the expected behavior, let me know!

from deepeval.

Falk358 commented on July 22, 2024

Hi @penguine-ip,

https://github.com/Falk358/mistral-src is the forked repo. You can find my implementation in the file mistral_wrapper_lm_eval.py.

It would be helpful to have some more exact specification what kind of answer format (beyond it being a string) the generate method should return, I could not find anything in deepeval's docs so far. For example, maximum length of the string, whether it should just have certain formatting.
I believe that the way deepeval evaluates MMLU changed in the new release version. In the old version it probably was able to extract "A", "B" "C" or "D" from the return value of generate(), while this doesnt seem to be the case anymore.

I checked the contents of answer in v0.21.43 and it looked somewhat like this: "A. Text from prompt after option A\n\n Some more content", where the second sentence was cut short due to my max_tokens limit. This is expected behaviour (it had the same layout in v0.21.36, where it still evaluated correctly). Hope this helps!

Kind Regards,
Max

from deepeval.

Support batching for custom models about deepeval HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent