Researchers Made an IQ Test for AI, Found They're All Pretty Stupid

There’s been a lot of talk about AGI lately—artificial general intelligence—the much-coveted AI development goal that every company in Silicon Valley is currently racing to achieve. AGI refers to a hypothetical point in the future when AI algorithms will be able to do most of the jobs that humans currently do. According to this theory of events, the emergence of AGI will bring about fundamental changes in society—ushering in a “post-work” world, wherein humans can sit around enjoying themselves while robots do most of the heavy lifting. If you believe the headlines, OpenAI’s recent palace intrigue may have been partially inspired by a breakthrough in AGI—the so-called “Q” program—which sources close to the startup claim was responsible for the power struggle.

But, according to recent research from Yann LeCun, Meta’s top AI scientist, artificial intelligence isn’t going to be general-purpose anytime soon. Indeed, in a recently released paper, LeCun argues that AI is still much dumber than humans in the ways that matter most.

That paper, which was co-authored by a host of other scientists (including researchers from other AI startups, like Hugging Face and AutoGPT), looks at how AI’s general-purpose reasoning stacks up against the average human. To measure this, the research team put together its own series of questions that, as the study describes, would be “conceptually simple for humans yet challenging for most advanced AIs.” The questions were given to a sample of humans and also delivered to a plugin-equipped version of GPT-4, the latest large language model from OpenAI. The new research, which has yet to be peer-reviewed, tested AI programs for how they would respond to “real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.”

The questions asked by researchers required the LLM to take a number of steps to ascertain information in order to answer. For instance, in one question, the LLM was asked to visit a specific website and answer a question specific to information on that site; in others, the program would have had to do a general web search for information associated with a person in a photo.

The end result? The LLMs didn’t do very well.

Indeed, the research results show that large language models were typically outmatched by humans when it came to these more complicated real-world problem-solving scenarios. The report notes:

In spite of being successful at tasks that are difficult for humans, the most capable LLMs do poorly on GAIA. Even equipped with tools, GPT4 does not exceed a 30% success rate for the easiest of our tasks, and 0% for the hardest. In the meantime, the average success rate for human respondents is 92%.

“We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions,” the recent study concludes.

LeCun has diverged from other AI scientists, some of whom have spoken breathlessly about the possibility of AGI being developed in the near term. In recent tweets, the Meta scientist was highly critical of the industry’s current technological capacities, arguing that AI was nowhere near human capacities.

“I have argued, since at least 2016, that AI systems need to have internal models of the world that would allow them to predict the consequences of their actions, and thereby allow them to reason and plan. Current Auto-Regressive LLMs do not have this ability, nor anything close to it, and hence are nowhere near reaching human-level intelligence,” said LeCun in a recent tweet. “In fact, their complete lack of understanding of the physical world and lack of planning abilities puts them way below cat-level intelligence, never mind human-level.”