I tried GPT-5.4, and most answers were really good - but a few had me concerned ...
Department of Environmental and Occupational Health Sciences, University of Washington, Seattle, United States College of Health Solutions, Arizona State University, Phoenix, United States ...
AI benchmarks rely on models not knowing they’re being tested. Anthropic revealed that Claude Opus 4.6 figured it out anyway, identifying the BrowseComp benchmark by name and decrypting its encrypted ...