Grapevine

AI challenge seeks questions to test human-level intelligence

Two of San Francisco's leading players in artificial intelligence have challenged the public to come up with questions capable of testing the capabilities of large language models (LLMs) like Google Gemini and OpenAI's o1. Scale AI, which specializes in preparing the vast tracts of data on which the LLMs are trained, teamed up with the Center for AI Safety (CAIS) to launch the initiative, Humanity's Last Exam.

Featuring prizes of US$5,000 (£3,800) for those who come up with the top 50 questions selected for the test, Scale and CAIS say the goal is to test how close we are to achieving "expert-level AI systems" using the "largest, broadest coalition of experts in history."

Why do this? The leading LLMs are already acing many established tests in intelligence, mathematics and law, but it's hard to be sure how meaningful this is. In many cases, they may have pre-learned the answers due to the gargantuan quantities of data on which they are trained, including a significant percentage of everything on the internet.

Data is fundamental to this whole area. It is behind the paradigm shift from conventional computing to AI, from "telling" to "showing" these machines what to do. This requires good training datasets, but also good tests. Developers typically do this using data that hasn't already been used for training, known in the jargon as "test datasets."

If LLMs are not already able to pre-learn the answer to established tests like bar exams, they probably will soon. The AI analytics site Epoch estimates that 2028 will mark the point at which the AIs will effectively have read everything ever written by humans. An equally important challenge is how to keep assessing AIs once that rubicon has been crossed.