mlgroupjlu / llm-eval-survey Goto Github PK
View Code? Open in Web Editor NEWThe official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
Home Page: https://arxiv.org/abs/2307.03109
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
Home Page: https://arxiv.org/abs/2307.03109
Hello,
Thank you for your excellent work on the survey paper!
I am one of the author for the papers you have listed but we had a major title change.
I am unsure if you guys are planning for regular updates for the paper, but if you do can you consider changing the our paper title from "Can Large Language Models Infer and Disagree Like Humans?" to "Can Large Language Models Capture Dissenting Human Voices?".
Thanks once again for this great work!
Hii, this is a really comprehensive work. Can you add our recent work to your survey?
CMB: A Comprehensive Medical Benchmark in Chinese
Thanks
Thank you for your nice survey.
Please consider adding our recent work, Large Language Models are not Fair Evaluators (https://arxiv.org/abs/2305.17926), to the list.
Our research has identified the biases present while using LLM as an evaluator, and we have proposed two strategies to alleviate these biases.
Thanks.😊
Paper: Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
link: https://arxiv.org/pdf/2306.14565.pdf
Name: LRV-Instruction
Focus: Multimodal
Notes: A benchmark to evaluate the hallucination and instruction following ability
bib:
@Article{liu2023aligning,
title={Aligning Large Multi-Modal Model with Robust Instruction Tuning},
author={Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan},
journal={arXiv preprint arXiv:2306.14565},
year={2023}
}
Hi,
I have read your insightful paper and found it to be a valuable contribution to the field.
I would like to kindly suggest adding our recent work to your survey.
📄 Paper: Ask Again, Then Fail: Large Language Models' Vacillations in Judgement
This paper uncovers that the judgement consistency of LLM dramatically decreases when confronted with disruptions like questioning, negation, or misleading, even though its previous judgments were correct. It also explores several prompting methods to mitigate this issue and demonstrates their effectiveness.
Thank you for your consideration! :)
Hi there,
Thanks for the effort in putting up this survey on LLMs evaluation.
I'd like to suggest adding our work, SpyGame, a framework for evaluating language model intelligence. We propose to use word guessing games to assess the language and theory of mind intelligences of LLMs.
Paper: Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models
GitHub: https://github.com/Skytliang/SpyGame
Propose to include this new benchmark for LLMs: https://arb.duckai.org/
Is all the evaluation of LLM done by changeing the prompts?
所有LLM评测工作都是通过改变模型的prompt实现的吗?有用其他的方法吗?
Hi team,
Thanks for your awesome survey! I was wondering if you might consider including the OpenCompass evaluation toolkit in your survey. At present, OpenCompass serves as a repository for over 50 benchmarks and enables systematic evaluation of LLMs. We are continually upgrading it to keep pace with the latest evaluation trends. I believe that its addition could provide an even richer context for your survey.
Best
Thanks for your interesting and comprehensive survey.
If possible, please consider adding our evaluation work about LLMs in chemistry, "What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks" (https://arxiv.org/abs/2305.18365) to the list.
Our work mainly establish a comprehensive benchmark containing 8 practical chemistry tasks to evaluate LLMs (GPT-4, GPT-3.5,and Davinci-003) for each chemistry task in zero-shot and few-shot in-context learning settings. We aim to solve the lack of comprehensive assessment of LLMs in the field of chemistry.
Thanks! 😊
Check https://lmexam.com/
It is in the references of your paper but not in the Github repository.
lacks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.