Steve Randy Waldman
@interfluidity.com

Out of 10,000, I suspect that you are right. But if the error rate is small, N will have to be pretty large to get that confidence interval. 1/

in reply to this
Steve Randy Waldman
@interfluidity.com

@mattbruenig.bsky.social can speak for himself (but i think he’s rarely on this site). but i suspect a heuristic may be something like comparing the LLM to a hypothetical research assistant. 2/

in reply to self
Steve Randy Waldman
@interfluidity.com

if, after extensive experience, it seems to perform as well as you’d expect, or significantly better than you’d expect, an RA whose work you’d use without full reverification, use the machine like an RA. 3/

in reply to self
Steve Randy Waldman
@interfluidity.com

there’s obviously a problem there, an RA, whatever their hit or miss rate, can be held accountable ex post, which has some bearing on the quality of errors. a human would know where they must be especially careful, beyond a baseline error rate. 4/

in reply to self
Steve Randy Waldman
@interfluidity.com

but a supervisor doesn’t typically publish an error rate and confidence interval for work not fully rechecked by research assistants. 5/

in reply to self
Steve Randy Waldman
@interfluidity.com

when the error rate will be small, computing meaningful values would be laborious exercises. i don’t think it’s clear that will always be a good tradeoff. obviously it will depend on the expected cost of errors. /fin

in reply to self