Steve Randy Waldman
@interfluidity.com

“you absolutely have to view LLM benchmarks from a position of default-distrust” @ouguoc.mastodon.online.ap.brid.gy describes how easily answers to benchmark problems can leak into the training set. ‪https://seinmastudios.com/posts/llm-benchmarks-are-not-trustworthy/

Steve Randy Waldman
@interfluidity.com

why is the link not a link? let's try that again: seinmastudios.com/posts/llm-be...

LLM benchmarks like SWE-bench are not trustworthy

in reply to self