AI2 has open-sourced its text-generating AI models, along with the data used to train them.

.
AI2 has open-sourced its text-generating AI models, along with the data used to train them.

Several GenAI language models being released by the nonprofit AI research institute, founded by the late Microsoft co-founder Paul Allen. The models, the institute claims, are more "open" than others - and importantly, licensed in such a way that developers may freely use them without much hindrance for training, experimentation, and even commercialization.

Named OLMo, an acronym for "open language models," the models and the dataset used to train them, Dolma — one of the largest public datasets of its kind — were designed to study the high-level science behind text-generating AI, according to AI2 senior software engineer Dirk Groeneveld.

Open" is a very overloaded word when it comes to text-generating models, Groeneveld said in an email interview with TechCrunch. We expect researchers and practitioners will seize the OLMo framework as an opportunity to analyze a model trained on one of the largest public data sets released to date, including all components necessary for building the models.

Just now, open source text-generating models are coming out of the woodwork-from Meta to Mistral. The threshold was crossed on releasing extremely capable models, which any developer would be able to take and fine-tune. Groeneveld contends that lots of these models can't be actually called open because they were trained "behind closed doors" and on proprietary, opaque sets of data.

On the contrary, the open-source OLMo models, developed collaboratively with partners such as Harvard, AMD and Databricks, come pre-equipped with code that was used to create their training data, training and evaluation metrics, and logs.

Groeneveld: According to the author, the top-performing OLMo model is "a promising and strong" alternative to Meta's LLaMA 2, depending on the application. While some benchmarks, mostly involving reading comprehension, are won over by OLMo 7B compared to LLaMA 2, others- question answering tests specifically are narrowly behind OLMo 7B.

There are also other constraints in the OLMo models: low-quality outputs in languages that aren't English-a strong argument because Dolma mainly has English-language content-and weak code-generating capabilities. Groeneveld stressed, however that these are early days.

"OLMo is not built to be multilingual yet," he said. "[And while] at this stage, the primary focus of the OLMo framework wasn't code generation, to give a head start to future code-based fine-turning projects, OLMo's data mix currently contains about 15% code."

I asked Groeneveld if he worried about whether these OLMo models – already usable commercially and performant enough to run on consumer GPUs like the Nvidia 3090 – might be exploited in unforeseen, perhaps nefarious ways by nefarious actors. Two of the most in-demand open text-generating models Hugging Face's Zephyr and Databricks' Dolly, "can be trained to produce reliably toxic content" by answering malicious prompts with "imaginative" hateful content according to a new analysis from Democracy Reporting International's Disinfo Radar project, which tracks and responds to trends and technologies about disinformation.

Groeneveld still says that, in the long run, the benefits outweigh the harms.

"Building this open platform will actually facilitate more research on how these models can be dangerous and what we can do to fix them," he said. "Yes, it's possible open models may be used inappropriately or for unintended purposes.". This approach, as well, promotes technical advances leading to more ethical models; it is prerequisite for verification and reproducibility, as these can be available only in full-stack openness; and reduce a growing concentration of power, creating equitable access.

In the coming months, AI2 plans to release larger, more capable OLMo models, including multimodal models--models that understand modalities beyond text--and additional datasets for training and fine-tuning. Like the initial OLMo and Dolma release, all resources will be made free and available on GitHub and Hugging Face's community-driven AI project hosting platform.

Blog
|
2024-11-09 20:45:57