Runware leverages custom hardware and advanced orchestration techniques to deliver fast AI inference.

The only demo sometimes is all you need to understand it. And that's certainly what applies in this case with Runware.
Runware leverages custom hardware and advanced orchestration techniques to deliver fast AI inference.

The only demo sometimes is all you need to understand it. And that's certainly what applies in this case with Runware. Head over to its website, enter a prompt, hit enter to generate an image, and be surprised at just how quick Runware can produce the image for you: less than a second.

Runware is one of the newest players in this arena of AI inference, or generation AI startups. The company is building its servers and optimizing the software layer on those servers in a bid to eliminate bottlenecks and increase the inference speed on image generation models. The startup has already attracted $3 million from Andreessen Horowitz's Speedrun, LakeStar's Halo II, and Lunar Ventures.

The company doesn't want to reinvent the wheel. It just wants to make it spin faster. Behind the scenes, Runware manufactures its own servers with as many GPUs as possible on the same motherboard. It has its own custom-made cooling system and manages its own data centers.

Runware has optimized the orchestration layer with BIOS and operating system optimizations for quick cold start times when running AI models on its servers. It has developed its own algorithms that allow workloads with interference to be allocated.

The demo is impressive in itself. Now, the company wants to use all this work in research and development and turn it into a business.

Unlike most GPU hosting companies, Runware won't rent its GPUs by the hour and use a GPU time pricing model. It believes the right incentive is to speed up workloads. That's why it's launching an image generation API, based on popular AI models from Flux and Stable Diffusion, with the more familiar cost-per-API-call fee structure.

"If you look at Together AI, Replicate, Hugging Face — all of them — they are selling compute based on GPU time," Co-founder and CEO Flaviu Radulescu told TechCrunch. "If you compare the amount of time it takes for us to make an image versus them. And then you compare the pricing, you will see that we are so much cheaper, so much faster."

"It's going to be impossible for them to match this performance," he said. "Especially in a cloud provider, you have to run on a virtualized environment, which adds additional delays."

As Runware looks at the entire inference pipeline, and optimizes the hardware and software, it is eyeing to leverage GPUs from multiple vendors in the not-so-distant future. That's an important exercise for many startups, given how substantial a lead Nvidia holds in the GPU arena-thus, Nvidia's GPUs are rather pricey.

Right now, we use just Nvidia GPUs. But this should be an abstraction of the software layer," Radulescu said. "We can switch a model from GPU memory in and out very, very fast, which allow us to put multiple customers on the same GPUs".

"So we are not like our competitors. They just load a model into the GPU and then the GPU does a very specific type of task. In our case, we've developed this software solution, which allow us to switch a model in the GPU memory as we do inference."

If AMD and the rest of the GPU vendors could develop compatibility layers that worked for typical AI workloads, Runware is just as well situated to build a hybrid cloud depending on GPUs from multiple providers. And that will certainly help if it wants to remain cheaper than peers at AI inference.

Blog
|
2024-10-02 20:00:22