#1 Social Media Management & Analysis Platform

Can games like Pictionary and Minecraft evaluate the ingenuity of AI models?

Most AI benchmarks tell us little. They ask a string of questions that can easily be answered through rote memorization, or about issues irrelevant to most of its users.

As a consequence, some AI advocates look towards games as ways of testing the problem-solving prowess of AIs.

Freelance AI developer Paul Calcraft developed an application in which two AI models play a game like Pictionary against each other. In this, one model will doodle something, and the other model will guess what it is.

"I thought this sounded super fun and potentially interesting from a model capabilities point of view," Calcraft told TechCrunch during an interview. "So I sat indoors on a cloudy Saturday and got it done."

Calcraft drew inspiration from a similar challenge posed by British programmer Simon Willison, who challenged models to draw a vector drawing of a pelican riding a bicycle. Like Calcraft, Willison selected a problem he believed would force the models to "think" beyond the contents of the training data.

The idea, he explained is to have an ungameable benchmark, where nothing can be done with just memorizing the particular answer to specific simple patterns which it has already been trained before.

It's also of the "un-gameable" variety, or at least 16-year-old Adonis Singh thinks so. He's developed a tool called mc-bench, which gives a model control over a Minecraft character and tests its ability to design structures, much along the lines of Microsoft's Project Malmo.

"It really tests the models on resourcefulness and gives them more agency," he told TechCrunch. "It's not nearly as restricted and saturated as [other] benchmarks."

The idea of using games as a benchmark for AI is not new. The concept has been around for decades: Mathematician Claude Shannon wrote in 1949 that games such as chess presented a suitable challenge for "intelligent" software. More recently, Alphabet's DeepMind created a model that could play Pong and Breakout; OpenAI trained AI to compete in Dota 2 matches; and Meta designed an algorithm that could hold its own against professional Texas hold 'em players.

But what's new is that enthusiasts hook up large language models – models with capabilities to process text, images, and so on – to games to probe how good they are at logic.

There are so many LLMs out there, whether it is Gemini and Claude or GPT-4o, and they have all their own "feel," so to speak. They "feel" different in one interaction to the next — a phenomenon difficult to quantify.

"LLMs are known to be sensitive to particular ways questions are asked, and just generally unreliable and hard to predict," Calcraft said.

Games offer a visual, intuitive comparison of how a model works and behaves compared to a text-based benchmark, says AI researcher and professor Matthew Guzdial at the University of Alberta.

We can think of every benchmark as giving us a different simplification of reality focused on particular types of problems, like reasoning or communication," he said. "Games are just other ways you can do decision-making with AI, so folks are using them like any other approach.".

For those that know a little bit of history regarding generative AI, you will find similarities in how Pictionary somewhat relates to generative adversarial networks, or GANs, in terms of the creator model forwarding images to a discriminator model for review.

Calcraft believes that Pictionary can capture an LLM's ability to understand concepts like shapes, colors, and prepositions-for instance, the meaning of "in" versus "on". He wouldn't say that the game is a reliable test of reasoning but argued that winning requires strategy and the ability to understand clues-neither of which models find easy.

"I really also enjoy the nearly adversarial flavor of game Pictionary, too, in that you've got two roles: someone draws, and somebody has to guess," he added. "The best to draw is not necessarily the most artistic, but whoever can most clearly express that concept to the rest of the LLM crowd-and particularly to the quicker, far less powerful models!).

"Pictionary is a toy problem that's not immediately practical or realistic," Calcraft cautioned. "That said, I do think spatial understanding and multimodality are critical elements for AI advancement, so LLM Pictionary could be a small, early step on that journey."

Singh believes that Minecraft is a useful benchmark too and can measure reasoning in LLMs. "From the models I've tested so far, the results literally perfectly align with how much I trust the model for something reasoning-related," he said.

Others aren't so sure.

Mike Cook, a research fellow at Queen Mary University specializing in AI, doesn't think Minecraft is particularly special as an AI testbed.

I think some of the fascination with Minecraft comes from people outside of the games sphere who maybe think that, because it looks like 'the real world,' it has a closer connection to real-world reasoning or action," Cook told TechCrunch. "From a problem-solving perspective, it's not so dissimilar to a video game like Fortnite, Stardew Valley, or World of Warcraft.". It's just been dressed up on top in a different way that makes it look more like something of an everyday set of tasks, such as building things or exploring.

As Cook pointed out, even the best game-playing AI systems generally don't adapt well to new environments, and can't easily solve problems they haven't seen before. For example, it's unlikely a model that excels at Minecraft will play Doom with any real skill.

"I think the good qualities Minecraft does have from an AI perspective are extremely weak reward signals and a procedural world, which means unpredictable challenges," Cook continued. "But it's not really that much more representative of the real world than any other video game."

That being the case, there sure is something fascinating about watching LLMs build castles.

Blog

2024-11-06 03:05:56

Can games like Pictionary and Minecraft evaluate the ingenuity of AI models?

Most AI benchmarks tell us little. They ask a string of questions that can easily be answered through rote memorization, or about issues irrelevant to most of its users.

Recent Post

Instagram has introduced a new option to add music tracks to Notes.

Meta has rolled out new updates for EU advertisers to comply with evolving regulations.

Meta and Microsoft have joined a new framework focused on the responsible use of AI.