We propose an evaluation methodology for AI systems operating in domains without ground truth. The key idea is to test whether the AI’s outputs violate consistency constraints of the problem domain. We show that this allows finding clear failures in state-of-the-art models, including GPT-4 on tasks where it is hard to evaluate, and superhuman chess engines.