“you absolutely have to view LLM benchmarks from a position of default-distrust” @ouguoc describes how easily answers to benchmark problems can leak into the training set. seinmastudios.com/posts/llm-be