The Ultimate Judge: Testing AI's Reliability through Physics