mlgroupjlu / llm-eval-survey Goto Github PK

View Code? Open in Web Editor NEW

1.3K 11.0 86.0 5.96 MB

The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

Home Page: https://arxiv.org/abs/2307.03109

benchmark evaluation large-language-models llm llms model-assessment

llm-eval-survey's People

Contributors

Stargazers

Watchers

Forkers

crazyivanz cyp-jlu-ai jxzhangjhu canslove tahmedge apollohuang1 sileod jeffara deepak3994 shi-kejian samuelchazy krish240574 jmanhype amit1nayak henryfcb andreajparker v3551g leejodie ankitshah009 kennethleungty aiwener open-models-hub aml-hassan-abd-el-hamid kkviks mhdella magicbowen sallensun hetieke2 zekaigalaxy wsgan001 aozakihayate keesh0410 devpod mengf1 mbrukman singl3 airobotzhang tiansiyuan mintlucas aolearner sripirakas millicentochieng jason-lee-lxx yiran-hao zhangxuemiao n830024282 zhangzhuobys jamieleedev xyzcs afilimonov donupup lbyiuou0329 chanliang jadegeek knowledgehacker ego ssteni wtjgh shhdan sunmoonbamboo lesliekuo wangxingjun778 cosmoshwpg guangceleo tracyguo2001 yuan776 wbing520 alsiatang aaaeeee zbs4017 zgimszhd61 liang-y-yu yy0649 suroorwijdan atomicsyed-99 letsgo-2 kristinaborisova neiodavince narcisoperez 404-diverge luyulalalala cuihuale2021

llm-eval-survey's Issues

Add Llama 2 as model evaluated?

Paper Title Change

Hello,
Thank you for your excellent work on the survey paper!

I am one of the author for the papers you have listed but we had a major title change.

I am unsure if you guys are planning for regular updates for the paper, but if you do can you consider changing the our paper title from "Can Large Language Models Infer and Disagree Like Humans?" to "Can Large Language Models Capture Dissenting Human Voices?".

Thanks once again for this great work!

An idea AGI evaluation -> An ideal AGI evaluation

Add CMB to your paper

Hii, this is a really comprehensive work. Can you add our recent work to your survey?
CMB: A Comprehensive Medical Benchmark in Chinese

Thanks

Add a new paper.

Thank you for your nice survey.

Please consider adding our recent work, Large Language Models are not Fair Evaluators (https://arxiv.org/abs/2305.17926), to the list.

Our research has identified the biases present while using LLM as an evaluator, and we have proposed two strategies to alleviate these biases.

Thanks.😊

Can you add LRV-Instruction to Your update Arxiv Version?

Paper: Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
link: https://arxiv.org/pdf/2306.14565.pdf
Name: LRV-Instruction
Focus: Multimodal
Notes: A benchmark to evaluate the hallucination and instruction following ability

bib:
@Article{liu2023aligning,
title={Aligning Large Multi-Modal Model with Robust Instruction Tuning},
author={Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan},
journal={arXiv preprint arXiv:2306.14565},
year={2023}
}

Can you add our recent work to your survey?

Hi,

I have read your insightful paper and found it to be a valuable contribution to the field.

I would like to kindly suggest adding our recent work to your survey.

📄 Paper: Ask Again, Then Fail: Large Language Models' Vacillations in Judgement

This paper uncovers that the judgement consistency of LLM dramatically decreases when confronted with disruptions like questioning, negation, or misleading, even though its previous judgments were correct. It also explores several prompting methods to mitigate this issue and demonstrates their effectiveness.

Thank you for your consideration! :)

Can you add SpyGame to your survey?

Hi there,

Thanks for the effort in putting up this survey on LLMs evaluation.

I'd like to suggest adding our work, SpyGame, a framework for evaluating language model intelligence. We propose to use word guessing games to assess the language and theory of mind intelligences of LLMs.

Paper: Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models
GitHub: https://github.com/Skytliang/SpyGame

ARB Benchmark

Propose to include this new benchmark for LLMs: https://arb.duckai.org/

Is all the evaluation of LLM done by changeing the prompts?

Is all the evaluation of LLM done by changeing the prompts?
所有LLM评测工作都是通过改变模型的prompt实现的吗？有用其他的方法吗？

Suggestion for adding OpenCompass to survey

Hi team,

Thanks for your awesome survey! I was wondering if you might consider including the OpenCompass evaluation toolkit in your survey. At present, OpenCompass serves as a repository for over 50 benchmarks and enables systematic evaluation of LLMs. We are continually upgrading it to keep pace with the latest evaluation trends. I believe that its addition could provide an even richer context for your survey.

Best

Suggestion about adding one evaluation paper about LLMs in science

Thanks for your interesting and comprehensive survey.

If possible, please consider adding our evaluation work about LLMs in chemistry, "What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks" (https://arxiv.org/abs/2305.18365) to the list.

Our work mainly establish a comprehensive benchmark containing 8 practical chemistry tasks to evaluate LLMs (GPT-4, GPT-3.5,and Davinci-003) for each chemistry task in zero-shot and few-shot in-context learning settings. We aim to solve the lack of comprehensive assessment of LLMs in the field of chemistry.

Thanks! 😊