Can LLMs Actually Judge Web Development Quality? Spoiler: Not Really
I recently came across a fascinating paper at ICLR’26 that tackles a question many of us AI developers have been wrestling with: can we trust LLMs to evaluate complex, interactive task? The authors focus on the domain of web development, and the short answer: we've got a