The common wisdom is that companies like Google, OpenAI, and Anthropic, with bottomless cash reserves and hundreds of top-tier researchers, are the only ones that can make a state-of-the-art foundation model. But as one among them famously noted, they "have no moat" — and Ai2 showed that today with the release of Molmo, a multimodal AI model that matches their best while also being small, free, and truly open source.
For clarity, Molmo is a visual interpretation engine, not a full-service chatbot like ChatGPT. It doesn't have an API and still isn't enterprise-ready; it doesn't crawl the web for you or for its purposes. Consider it as the part of those models that perceives an image, understands, and can describe or answer questions about it.
Molmo, like other multimodal models, can identify and answer questions about nearly any everyday situation or object. How do you operate this coffee machine? How many dogs in this image have their tongues out? Which of the options on this menu are vegan? What are the variables in this diagram? It's the sort of visual understanding task we've seen demonstrated with varying amounts of success and latency for years.
What's different, however, isn't necessarily Molmo's capabilities (which you can see in the demo below, or test here), but how it achieves them.
Visual understanding is a broad domain, of course, running the gamut from counting sheep in a field to guessing how a person is feeling to summarizing a menu. As such it's difficult to describe, let alone test quantitatively, but as Ai2 CEO Ali Farhadi explained at a demo event at the research organization's HQ in Seattle, you can at least show that two models are similar in their capabilities.
One thing that we're showing today is that open is equal to closed," he said, "And small is now equal to big." (He clarified that he meant ==, meaning equivalency, not identity; a fine distinction some will appreciate.).
One near constant in AI development has been "bigger is better." More training data, more parameters in the resulting model, and more computing power to create and operate them. But at some point, you literally can't make them any bigger: There's not enough data to do it, or the compute costs and times get so high it's self-defeating. You just have to make do with what you have or even better, do more with less.
According to Farhadi, "while it does everything on par with GPT-4o, Gemini 1.5 Pro, and Claude-3.5 Sonnet, it is, according to best estimates, about a tenth of their size. And it approaches their level of capability with a model that's a tenth of that.
"There are a dozen different benchmarks that people evaluate on.". I don't like this game, scientifically… but I had to show people a number," he explained. "Our biggest model is a small model, 72B, it's outperforming GPTs and Claudes and Geminis on those benchmarks. Again, take it with a grain of salt; does this mean that this is really better than them or not? I don't know. But at least to us, it means that this is playing the same game.".
Want to try to stump it? Take the public demo for a spin (works on mobile too.); if you don't want to log in, just refresh or scroll up and "edit" the original prompt to swap out the image.
The trick is to use fewer, but better quality, data. Instead of training on a library of billions of images that can't possibly all be quality controlled, described, or deduplicated, Ai2 curated and annotated a set of just 600,000. Obviously that's still a lot, but compared with six billion it's a drop in the bucket—a fraction of a percent. While this drops off a bit of long tail stuff, their selection process and interesting annotation method gives them very good quality descriptions.
Interested in how? Well, they show people an image and tell them to describe it—out loud. Turns out people talk about stuff differently from how they write about it, and this produces not just accurate but also conversational and useful results. The resulting image descriptions Molmo produces are rich and practical.
That is best shown in its new, and for at least a few days unique, ability to "point" at the relevant parts of the images. When asked to count the dogs in a photo (33), it put a dot on each of their faces. When asked to count the tongues, it put a dot on each tongue. This specificity lets it do all kinds of new zero-shot actions. And importantly, it works on web interfaces as well: Without looking at the website's code, the model knows how to navigate a page, submit a form, and so on. (Rabbit recently demoed something similar for its r1, for next week's release.)
So why does it matter? Models are coming out every day practically. Google just revealed a few. OpenAI has a demo day. Perplexity is always showing something else off. Meta is hyping up Llama version whatever.
Well, Molmo is fully free and open source and small enough that it will run on your local machine. No API, no subscription, no water-cooled GPU cluster needed. This intent of making and releasing the model is really to grant developers and creators to be able to make AI-powered apps, services, and experiences without having to ask permission from-and pay-one of the world's biggest tech companies.
We're targeting all these researchers, developers, app developers who don't even know how to deal with these massive models. A key principle in targeting such an enormous range of audience is the key principle that we have been pushing for a while, which is: make it more accessible, Farhadi said. We are releasing everything that we have done. This includes data, cleaning, annotations, training, code, checkpoints, evaluation. We're releasing everything about it that we have developed.
He expects people to begin building with this dataset and code right away, including some of his better-off competitors, who vacuum up any "publicly available" data—anything that's not nailed down. ("Whether they mention it or not is a whole different story," he added.)
The world of AI moves at the speed of light, and increasingly, the giants find themselves racing to the bottom, pushing prices to minimal levels, while raising hundreds of millions to cover the cost. How can a company justify such astronomical value when there are free, open source alternatives capable of similar functions? At the least, Molmo has proven that, while there may be legitimate question about whether the emperor has clothes, he indeed does not have a moat.