Bluesky archive: @interfluidity.com

Jul 5, 2025
8:03 AM

Steve Randy Waldman

@interfluidity.com

“you absolutely have to view LLM benchmarks from a position of default-distrust” @ouguoc.mastodon.online.ap.brid.gy describes how easily answers to benchmark problems can leak into the training set. ‪https://seinmastudios.com/posts/llm-benchmarks-are-not-trustworthy/

Jul 5, 2025
8:26 AM

Steve Randy Waldman

@interfluidity.com

why is the link not a link? let's try that again: seinmastudios.com/posts/llm-be...

LLM benchmarks like SWE-bench are not trustworthy

in reply to self