Code Monkey home page Code Monkey logo

Comments (8)

penguine-ip avatar penguine-ip commented on July 22, 2024

Hey @alexkreidler thanks for the suggestion. Can you show how you can do this using your Mistral example? An example to show how to take in the batch of strings and how you return it would be helpful for us to implement this interface. Also feel free to implement it yourself if it is faster than way

from deepeval.

Falk358 avatar Falk358 commented on July 22, 2024

Hi @penguine-ip ,

just chiming in that i would be interested in this feature as well. Im working with the official mistral repository: https://github.com/mistralai/mistral-src and using a pruned version of the orginal mistral model. In their main, the implement a generate() method which can take a list of prompts as input. Therefore, an interface something like this would be very useful:

class CustomMistral(DeepEvalBaseLLM):
     def __init__(self, model):
         self.model = model
     ...
     ...
     ...
    def generate(self, prompts: list) -> list:
        ...
        results = model.generate(prompts)
        ... # format results correctly
        return results

Being able to generate responses by using batched requests in each forward pass significantly reduces compute time (depending on the size of the batch). In my own reference implementation of MMLU eval a batch size of 8 reduced compute time by a factor of 7, compared to a batch size of 1.

from deepeval.

penguine-ip avatar penguine-ip commented on July 22, 2024

@kritinv I think there could be a generate_batch() just for the benchmarks?

@Falk358 Are there limits to the batch size for your mistral example?

from deepeval.

Falk358 avatar Falk358 commented on July 22, 2024

@penguine-ip As far as I know, mistral's generate method doesn't impose any batch size limits on the user. The underlying pytorch code will throw a CudaOutOfMemoryError if GPU Memory is full. In my concrete case, im running an rtx 3090, which means that the max batch size i can use is 8 ( 8 llm requests per forward pass).

I think a seperate generate_batch() sounds like a very good solution. It would definetly be compatible with my use case, provided I can control the batch size passed to it in the parameter somehow.

from deepeval.

penguine-ip avatar penguine-ip commented on July 22, 2024

@Falk358 and @alexkreidler, this took a lot longer than i thought, but it is out: https://docs.confident-ai.com/docs/benchmarks-introduction#create-a-custom-llm

Can you please check if it is working (latest release v0.21.43) and whether the example in the docs is correct? Thanks!

from deepeval.

Falk358 avatar Falk358 commented on July 22, 2024

Hi @penguine-ip,

thanks for the swift implementation!
Unfortunately, there seems to be a problem with the evaluation for my generate() function when using release v021.43. I was using v021.36 previously. This is my generatemethod in v0.21.36:

 def generate(self, prompt: str) -> str:
        model = self.load_model()
        final_prompt = f"[INST]{prompt}[/INST]"

        result, _ = generate(
            prompts=[final_prompt],
            model=model,
            tokenizer=self.tokenizer,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
        )
        answer = result[0].rsplit(sep="[/INST]", maxsplit=1)[1]
        answer = answer.strip()
        if len(answer) == 0:
            answer = " "  # return non-zero string to avoid crashes during eval
        return answer

this code calls the main.generate() function from https://github.com/mistralai/mistral-src, which returns a list of strings (called result in the code above). Each entry in this list has the following format: "[INST]prompt_passed_to_model[\INST]answer_of_model". Therefore, i split the answer after the closing "[\INST]" token and take the rest to obtain the model answer. The answer is also stip()ed to avoid leading whitespace being evaluated further down the pipeline. My model generates further explanation to the answer which i keep. In v0.21.36 this did not break evaluation on the "high_school_european_history" subset of MMLU (i get an accuracy of 0.6 for my experiment), in v0.21.43 this breaks and i get an accuracy of 0.
All of this broke without any other changes to the class, I did not start writing batch_generate().

Is my code flawed in some way or is this a bug?

Thanks so much for your help!

Max

from deepeval.

penguine-ip avatar penguine-ip commented on July 22, 2024

Hey @Falk358 , hard to tell immediately just from looking, do you have a forked version? I can show you where to add a single line of print statement to know if this is the expected behavior, let me know!

from deepeval.

Falk358 avatar Falk358 commented on July 22, 2024

Hi @penguine-ip,

https://github.com/Falk358/mistral-src is the forked repo. You can find my implementation in the file mistral_wrapper_lm_eval.py.

It would be helpful to have some more exact specification what kind of answer format (beyond it being a string) the generate method should return, I could not find anything in deepeval's docs so far. For example, maximum length of the string, whether it should just have certain formatting.
I believe that the way deepeval evaluates MMLU changed in the new release version. In the old version it probably was able to extract "A", "B" "C" or "D" from the return value of generate(), while this doesnt seem to be the case anymore.

I checked the contents of answer in v0.21.43 and it looked somewhat like this: "A. Text from prompt after option A\n\n Some more content", where the second sentence was cut short due to my max_tokens limit. This is expected behaviour (it had the same layout in v0.21.36, where it still evaluated correctly). Hope this helps!

Kind Regards,
Max

from deepeval.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.