Alibaba unveils a new 'open' competitor to OpenAI's GPT-4 reasoning model.

A new so-called “reasoning” AI model, QwQ-32B-Preview, has arrived on the scene. It’s one of the few to rival OpenAI’s o1, and it’s the first available to download under a permissive license.
Alibaba unveils a new 'open' competitor to OpenAI's GPT-4 reasoning model.

A new so-called “reasoning” AI model, QwQ-32B-Preview, has arrived on the scene. It’s one of the few to rival OpenAI’s o1, and it’s the first available to download under a permissive license.

Developed by Alibaba's Qwen team, QwQ-32B-Preview contains 32.5 billion parameters and can consider prompts up ~32,000 words in length; it outperforms o1-preview and o1-mini, two reasoning models that OpenAI has released so far. (Parameters roughly correspond to a model's problem-solving skills, and models with more parameters typically perform better than those with fewer parameters. OpenAI doesn't disclose the parameter count for its models.)

Per Alibaba’s testing, QwQ-32B-Preview beats OpenAI’s o1 models on the AIME and MATH tests. AIME uses other AI models to evaluate a model’s performance, while MATH is a collection of word problems.

QwQ-32B-Preview can solve logic puzzles and answer reasonably challenging math questions, thanks to its "reasoning" capabilities. But it isn't perfect. Alibaba notes in a blog post that the model might switch languages unexpectedly, get stuck in loops, and underperform on tasks that require "common sense reasoning."

Unlike most AI, QwQ-32B-Preview and other reasoning models successfully fact-check themselves. This saves them from some of the pitfalls that normally trip up models, but they often take longer to arrive at solutions. Like o1, QwQ-32B-Preview reasons through tasks by planning ahead and performing a series of actions that help the model tease out answers.

QwQ-32B-Preview, which could be installed and downloaded from the AI dev platform Hugging Face and seems to be along the same lines as the recent DeepSeek reasoning model because it handles sensitive political matters lightly. Alibaba and DeepSeek, being Chinese companies, are benchmarked by China's internet regulator to ensure that the responses of their models "embody core socialist values." Many Chinese AI systems decline to respond to topics that might raise the ire of regulators, like speculation about the Xi Jinping regime.

Asked "Is Taiwan a part of China?," QwQ-32B-Preview said yes (and "inalienable" as well) — a view at odds with much of the rest of the world but aligned with that of China's ruling party. Questions about Tiananmen Square elicited no response.

QwQ-32B-Preview is "openly" available under an Apache 2.0 license, which means it can be used for commercial purposes. Only certain components of the model have been released, which means QwQ-32B-Preview cannot be replicated or much insight into the inner workings of the system can be gleaned. The "openness" of AI models is not a settled question, but there is a general continuum from more closed (API access only) to more open (model, weights, data disclosed) and this one falls in the middle somewhere.

The increased attention on reasoning models comes as the viability of "scaling laws," long-held theories that throwing more data and computing power at a model would continuously increase its capabilities, are coming under scrutiny. A flurry of press reports suggest that models from major AI labs including OpenAI, Google, and Anthropic aren't improving as dramatically as they once did.

That has led to a rush for new AI approaches, architectures, and development techniques, one of which is test-time compute. Also known as inference compute, test-time compute essentially gives models extra processing time to complete tasks and underpins models like o1 and QwQ-32B-Preview.

Big labs besides OpenAI and Chinese firms are betting test-time compute is the future. According to a recent report from The Information, Google has expanded an internal team focused on reasoning models to about 200 people, and added substantial compute power to the effort.

Blog
|
2024-11-28 18:36:50