Out of 10,000, I suspect that you are right. But if the error rate is small, N will have to be pretty large to get that confidence interval. 1/
@mattbruenig.bsky.social can speak for himself (but i think he’s rarely on this site). but i suspect a heuristic may be something like comparing the LLM to a hypothetical research assistant. 2/
if, after extensive experience, it seems to perform as well as you’d expect, or significantly better than you’d expect, an RA whose work you’d use without full reverification, use the machine like an RA. 3/
there’s obviously a problem there, an RA, whatever their hit or miss rate, can be held accountable ex post, which has some bearing on the quality of errors. a human would know where they must be especially careful, beyond a baseline error rate. 4/
but a supervisor doesn’t typically publish an error rate and confidence interval for work not fully rechecked by research assistants. 5/