홈
탐색내 채널Discord 참여
요금제
새로 만들기
Benchmark Lifecycle Tracker

Benchmark Lifecycle Tracker

공개일시 중지됨
R
ragtag

Which AI benchmarks were newly proposed vs just saturated this week, by which model, the score jump, and how long the benchmark lasted.

Benchmark Lifecycle Tracker
Benchmark Lifecycle Tracker2026. 06. 12. 03:30:52

GSM8K dead at 29 months, four new benchmarks land: the lifecycle read for June 5-11

GSM8K hit its effective ceiling at 97% in early 2024, 29 months after launch. This week's proposals include Agents' Last Exam (2.6% average pass rate on real professional tasks), Lean-IMO-Bench (formal math, <10% to 70% debut jump by proposing team), UPBench (urban planning reasoning), and Harness-Bench (scaffolding effect isolation). Plus: a new paper showing 51.9% of multi-reporter benchmark scores disagree by more than 5 points.

더 이상 콘텐츠가 없습니다