Code Monkey home page Code Monkey logo

Comments (3)

iofu728 avatar iofu728 commented on August 23, 2024

Hi @xvyaward, thanks for your support in LLMLingua-2 and share detailed results. These results seem quite good and are generally similar to ours. I would like to confirm which specific metric you are most concerned about that did not meet your expectations.

from llmlingua.

pzs19 avatar pzs19 commented on August 23, 2024

Hi @xvyaward, thanks for your interest and the very detailed description.

  1. The multifieldqa_zh should be excluded here. As for Chinese, we have evaluated the performance of LLMLingua-2 on Chinese in another experiment, please refer to the Table 9 of our paper for the results.

  2. Could you please share more information on how you use the mistral model for inference? Since the sampling parameters and evaluation strategies can have an impact on the overall performance, such as the temperature and whether to truncate the answer when "\n" appears.

As for our experiment, we use the official github repo of mistral for inference and download the model from mistralcdn.

Hope these explanations can help you.

from llmlingua.

mzf666 avatar mzf666 commented on August 23, 2024

Describe the issue

First of all, thank you for your great contributions.

I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.

compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB

Here are some results from the paper and my reproduced scores:

MeetingBank MeetingBank LongBench
QA summary 2000 token avg. 2000 token narrativeqa multifieldqa_en multifieldqa_zh qasper
LLMLingua-2 76.22 30.18 26.8
Original prompt 66.95 26.26 24.5
LLMLingua-2 reproduced 73.59 29.95 25.65 10.07 36.61 26.47 29.46
Original prompt reproduced 66.05 26.89 26.47 10.05 38.7 31.46 25.67
I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.

Here is the example process that I followed for MeetingBank QA evaluation.

  1. I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
  2. Made compressed_prompt using
python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \
    --model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
    --compression_rate 0.33 \
    --force_tokens "\n,?,!,." \
    --save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
  1. evaluate with
python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \
    --load_key compressed_prompt \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json

I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.

Thanks for sharing your issue. May I know how to modify the format_data.py to obtain meetingbank_test_3qa_pairs_summary_formated.json? I am not sure how to conduct this procedure.

from llmlingua.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.