One of the most commonly used tricks to make AI models more efficient, quantization, has limits — and the industry may be running fast into them.
Quantization in AI In the context of AI, it is reducing the number of bits that the computer uses to represent information. Thinking about it as such: when someone asks you what time it is, you would probably say "noon," not "oh twelve hundred, one second, and four milliseconds." Both are correct, but one is a little more precise. The question is how precise you actually need to be in a given context.
AI models are comprised of many variables that can be quantized — that is, parameters, the internal variables models use to make predictions or decisions. That's useful since models do millions of calculations when they run. Mathematically speaking, less computationally intensive are quantized models with fewer bits representing their parameters. (For purposes of clarification, this is distinct from "distillation," which is a more complex and selective pruning of parameters).
But quantization may have more trade-offs than so far assumed.
Ever-shrinking model
As this paper out from researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon now explains, the best thing quantized models do - cut down memory consumption - is hurt performance the more the original, unquantized version of the model was trained for on a large corpus of data. In other words, at some point it will actually be better just to train a smaller model rather than cook down a big one.
This could be bad news for AI companies training extremely large models-known to improve answer quality-and then quantizing them in an effort to make them less expensive to serve.
The effects already manifest. A few months ago, developers and academics reported that quantizing Meta's Llama 3 model tended to be "more harmful" compared to other models, potentially due to the way it was trained.
In my opinion, the number one cost for everybody in AI is and will continue to be inference, and our work shows one important way to reduce it will not work forever," according to Tanishq Kumar, a Harvard mathematics student and the first author of the paper.
Ironically, AI model inferencing-that is, running a model-such as when ChatGPT answers a question-is often more expensive in aggregate than model training. In fact, Google reportedly spent around $191 million to train one of its flagship Gemini models-one certainly princely sum. But if the company chose a model to generate only 50-word answers to half of all Google Search queries, it would cost about $6 billion annually.
The large AI labs have adopted training models on vast datasets based on the assumption that "scaling up," that is, increasing the amount of data and compute used in training, leads to more capable AI.
For example, Meta exposed Llama 3 to 15 trillion tokens. Tokens are bits of raw data; 1 million tokens are roughly equal to about 750,000 words. The previous generation of Llama was trained on just 2 trillion tokens.
Evidence suggests that scaling up eventually offers diminishing returns; Anthropic and Google reportedly just recently trained gargantuan models that failed to meet internal benchmark expectations. But there is little indication that the industry is at all poised to meaningfully abandon these entrenched scaling approaches.
Exactly how accurate, exactly?
So, if labs are leery of training models on smaller sets of data, is there a way models could be made less susceptible to degradation? Possibly. Kumar says that he and coauthors found that training models in "low precision" can make them more robust. Bear with us for a moment as we dive in a bit.
"Precision" in this context means the number of digits a numeric data type can represent exactly. A data type is a set of data values, typically defined by a set of possible values and allowed operations; for example, data type FP8 uses only 8 bits to represent a floating-point number.
Most current models are thus trained at 16-bit or "half precision" and "post-train quantized" to 8-bit precision. Certain model components (e.g., its parameters) are converted to a lower-precision format at the cost of some accuracy. Think of it as doing the math down to a few decimal places but rounding off the last digit to the nearest 10th, giving you, usually, the best of both worlds.
Hardware vendors, like Nvidia, are aggressively pushing the envelope for lower precision in quantized model inference. The company's new Blackwell chip is said to support a new data type called FP4, which contains 4-bit precision, and is pitching it as a blessing for memory- and power-constrained data centers.
However, extremely low quantization precision is not altogether desirable. According to Kumar, unless the original model is incredibly large in terms of its parameter count, precisions lower than 7- or 8-bit may see a noticeable step down in quality.
If all this sounds pretty technical, don't worry-it is. But the moral of the story is just this: that AI models are not fully understood, and known shortcuts for many kinds of computation do not work here. You wouldn't say "noon" if someone asked when they started a 100-meter dash, would you? It's not that obvious, of course, but the idea is similar:
"The key point of our work is that there are some limitations you can't get around in a naïve way," Kumar concluded. "We hope our work adds nuance to the discussion that often seeks increasingly low precision defaults for training and inference."
Kumar admitted that his and his friends' study was rather at small-scale - they intend to test it with more models in the future. Still, he believes that at least one insight will hold: There's no free lunch when it comes to reducing inference costs.
"Bit precision matters, and it's not free," he said. "You cannot reduce it forever without models suffering.". Models are capped in capacity, so instead of attempting to stuff a quadrillion tokens into a tiny model, in my view far more work will go into closely curating and filtering the data so that just the very best is fed into much smaller models. New architectures explicitly aiming to stabilize low-precision training will, I hope, prove important in the near future.