As seen <a href="https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Poor Benchmark Results (Needs Addressed) about stablelm HOT 5 OPEN

MarkSchmidty commented on August 30, 2024 4

Poor Benchmark Results (Needs Addressed)

from stablelm.

Comments (5)

jon-tow commented on August 30, 2024 11

We're well aware of this (I was one of the core devs of lm-eval - we perform downstream benchmarking the same way 😄). A few things are going on for why we believe this is happening, and hopefully, we can pin them down in our following write-up.
For the time being, you should find that modifying the contexts into dialog prompt format (e.g. Question: -> User: ) should improve scores.

from stablelm.

MarkSchmidty commented on August 30, 2024 8

Okay, I made the issue title less alarming since you've chimed in.

Open communication about the issue and what is being done to address it would be appreciated by many. This thread/issue may be a good place to reach more technical users/devs who are keeping tabs.

from stablelm.

lhl commented on August 30, 2024 4

I dropped a line to the lm@stability address mentioned in the announcement to ask about if there was anything I'm doing wrong w/ benchmarks, was curious evals weren't included w/ the model card even as an alpha release (or a note that low benchmark scores were a known issue at least), but will be following w/ interest.

Curious as a foundational model, what's going on w/ dialog prompt formatting? I grepped through tasks and question is used by the QA tasks, so would impact piqa, but how about hellaswag (completions) or winogrande (it's own format)?

from stablelm.

MohamedAliRashad commented on August 30, 2024 2

Any updates on this ?

from stablelm.

mallorbc commented on August 30, 2024 2

@jon-tow Using that prompt format for the base model will help? Perhaps you are talking about the tuned model?

from stablelm.

Recommend Projects

Poor Benchmark Results (Needs Addressed) about stablelm HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent