Navigating the Complexities of LLM Quantization: Techniques, Trade-offs, and Real-World Implications
The latest advancements and challenges in LLM compression and quantization techniques to optimize latency, model size, and deployment costs for LLM applications
In our last blog we talked about the challenges in building an LLM product. Two of the challenges we mentioned were how it takes time to get a response from these models and the model’s large size often means that you have to deploy them on multiple GPUs.
One method people are using to solve these challenges is to make the model’s smaller by using model compression techniques like quantization, pruning and knowledge distillation. In this article we wanted to talk about the current state of LLM compression. The key takeaways are as follows:
Compression can reduce the size of your neural network by 4x with very little drop in perplexity and other quality metrics
However, compression may not actually provide the latency savings you expect
LLMs with 10B+ parameters show a decrease in quality when quantizing activations (which is what helps reduce latency)
The effects of LLM compression on bias is not known. If previous works on smaller models hold up for really large LLMs, then the compression will have adverse effects
Using a smaller LLM finetuned on your specific task may be better than compressing a really large LLM.
Why do we need to compress LLMs
Getting a response from an LLMs is a computationally intensive task, and their execution time can be quite long. This can be problematic in real-time applications where immediate responses are required. By compressing LLMs, we can significantly reduce their execution time, allowing for faster responses and the ability to handle more load.
LLMs are incredibly large, with billions or even hundreds of billions of parameters. Many of these models require multiple GPUs to run efficiently (some will not even load on 1 GPU). Compressing LLMs can significantly reduce their size, making them more deployable on less powerful devices, even mobile phones!
Serving LLMs requires multiple GPUs, leading to communication bottlenecks between them. By compressing LLMs, we can reduce the amount of memory required to execute these models and alleviate communication bottlenecks.
By optimizing our models so that they can be deployed on less powerful machines (which in this context can still mean machines with a GPU). We can reduce deployment costs which is a big bottleneck in LLM products as they grow in complexity, making it possible to use these models in a wider range of contexts.
So far three main compression techniques have been applied to LLMs: quantization, pruning and knowledge distillation.
Pruning involves removing unnecessary parameters from the model, reducing its size while maintaining its accuracy. Quantization involves reducing the number of bits used to represent parameters, which can significantly reduce the model's size and memory footprint. Distillation involves training a smaller model to mimic the behavior of a larger model, which can reduce the size of the model while maintaining its accuracy.
Reducing the parameters of a model (by any technique) while increasing performance, will lead to reduced quality metrics like perplexity and accuracy. The goal of all these techniques is to reduce the size of the model while trying to maintain or reduce the drop in quality. In this blog we will talk more about quantization which is the most popular compression technique in LLMs.
In the realm of quantization, there are two primary categories: weight quantization, which involves reducing the precision of weight values, and combined weight and activation quantization, where the precision of both weights and activations is reduced. When employing weight quantization, it becomes necessary to dequantize the weights during the inference process. Failing to do so can lead to the activations growing larger with each successive layer.
On the other hand, if we manage to quantize both the activations and the weights, the entire network can be executed at a lower precision. This approach offers numerous benefits, including reduced model size, decreased memory consumption, and shorter inference time. Conversely, while weight quantization does lead to a smaller model size, the need to upscale weights during inference prevents a reduction in memory consumption. In some cases, it may even contribute to increased latency due to the added overhead of upscaling.
LLM.int8() was one of the first papers that outlined some of the challenges of deploying LLMs and suggested quantization as a way to reduce the model size. Their quantization spec was able to quantize models up to INT8 precision with very little change in perplexity scores. However, while they were able to reduce model size, their quantization spec had an overhead which caused an increase in model latency. You can actually use this technique of quantization in huggingface to quantize a model. This was achieved by integrating Tim Dettmer’s (one of the authors of this paper) project bits and bytes.
The GPTQ paper was able to take it a step further and quantize LLMs to as low as 3 bit precision. They found that as the model size gets larger, the precision loss due to quantization has negligible on the quality of the model. This is an interesting result and we will discuss the implications of it more in the next section. However, similar to the LLM.int8() paper, they also do not quantize activation meaning that while we see savings in model size, we don’t see any savings in latency or memory consumption during execution.
One of the reasons for a drop in quality with quantization is the loss in precision of the weights. The RPTQ method tried to solve this loss in precision by clustering similar weights and activation and then quantizing. This allowed them to quantize both activations and weights to 3 bits. To implement this quantization approach, they used a calibration dataset to determine the maximum and minimum values for each activation channel. Then, they grouped the channels into clusters using the KMeans algorithm based on their maximum and minimum values. They then proceeded to rearrange the channels within each cluster to be close to each other, allowing them to share the same set of quantization parameters. This reduced the quantization precision loss. However, it's unclear whether this adds significant overhead to the inference process, as the authors did not discuss the latency of their quantized models in depth.
This is an amazing project that I have been following closely. Their INT4 quantization spec is really simple and was hacked overnight. They take groups of 32 consecutive weights and assign a scaling factor and zero offsets. You can better performance with a variable zero offset (which they support). They can quantize the 6B parameter GPT-J model with a size of 11543 MB to 3610 MB. This reduces the inference time on an M1 Pro Macbook, from 125ms to 46ms!
Compressing LLMs with quantization can bring about several benefits. However, there are also some challenges associated with this process.
Large LLMs vs Small LLMs
The biggest challenge is the unintuitive results we get when performing weight quantization (WQ) vs weight and activation quantization (WAQ). They have different impacts on model accuracy. INT8 WQ leads to very little loss in accuracy, particularly for large models. Compressing to INT4 WQ affects smaller models even more disproportionately.
However, WAQ can result in a larger degradation in performance for larger models. In some cases, the accuracy degradation may be higher than the advantage of using a larger model! Therefore, it is essential to carefully evaluate the trade-offs between these two techniques when compressing LLMs. You can read more about this here.
Quantization Aware Training
In TinyML, another field where model compression is an important part of deploying applications, Quantization Aware Training (QAT) is a common technique used to gain back any drop in accuracy. QAT works by simulation the loss of precision caused by quantization during the training model process. However, this process can be difficult, expensive and time-consuming to perform especially for really large LLMs.
Compressing LLMs can lead to an increase in bias. When we remove parameters from the model, we risk losing some of the diversity in the data. This can result in a biased model that may not perform well in real-world scenarios. To the best of our knowledge, there has not been a study on the effects of quantization on LLM bias. Previous works have shown that compressed models tend to amplify existing algorithmic bias and disproportionately impact performance on underrepresented features. Another work on BERT-based models has shown similar results. They also suggest methods to mitigate these issues.
For generative LLMs, compression can lead to a loss of vocabulary and richness in the output. As we reduce the number of parameters in the model, we risk losing some of the nuances and details in the data. This can result in less expressive and less accurate output.
Quantization Time and Latency
LLMs are so large it can take a few hours to quantize some these models. Even though quantization is a one-time activity, it is still computationally very intensive and may need access to GPUs to run quickly. The results below show the time it took to quantize models using GPTQ on an Nvidia A100 GPU. Even the smallest model took nearly 3 minutes to quantize!
Finally, it is essential to note that in some cases, quantizing a model can lead to no latency decrease and, in some cases, even an increase in latency! Inference times can vary based on the architecture of the model and your hardware.
Quantizing Large LLM vs finetuning or training a smaller LLM
With these weird compression effects, you might be wondering if it makes sense to compress a 10B+ parameter LLM. I think that more work needs to be done to reduce the issues surrounding quantization and get better hardware support. Other compression methods, like knowledge distillation (Alpaca) show more promising results than quantization.
While recent methods like RPTQ have shown ways to reduce the drop in perplexity of compressed 10B+ LLMs, the effects of bias on compression are still unknown. I think a better approach would be to combine finetuning with quantization. A pattern that I have seen some people use is to quantize medium-sized LLM (<10B parameters) and then finetune it with LORA. This approach should help with the aforementioned issues.