LLM Optimization
PyTorch 2.0
torch.compile
Depending on the model and the GPU, torch.compile()
yields up to 30% speed-up during inference. To use torch.compile()
, simply install any version of torch
above 2.0.
Underpinning torch.compile
are new technologies – TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.
- TorchDynamo captures PyTorch programs safely using Python Frame Evaluation Hooks and is a significant innovation that was a result of 5 years of our R&D into safe graph capture
- AOTAutograd overloads PyTorch’s autograd engine as a tracing autodiff for generating ahead-of-time backward traces.
- PrimTorch canonicalizes ~2000+ PyTorch operators down to a closed set of ~250 primitive operators that developers can target to build a complete PyTorch backend. This substantially lowers the barrier of writing a PyTorch feature or backend.
- TorchInductor is a deep learning compiler that generates fast code for multiple accelerators and backends. For NVIDIA and AMD GPUs, it uses OpenAI Triton as a key building block.
Imagine the GPU as this super massive factory with a ridiculous amount of compute available. Then, imagine the CPU as some messenger shuttling instructions back and forth to the GPU. Remember, in large scale deep learning systems, the GPU is responsible for doing 100% of the work! In such systems, the only role of the CPU is to tell the GPU what work it should be doing.
So, the CPU runs over and tells the GPU to do an “add”, but by the time the CPU can give the GPU another chunk of work, the GPU has long finished the previous chunk of work.
Despite the fact that the GPU needs to perform thousands of computations while the CPU only needs to do orchestration work, this is surprisingly common! There’s a variety of reasons for this, ranging from the fact that the CPU is likely running some single-threaded Python to the fact that GPUs are just incredibly fast nowadays.
Some Benchmarks
A100
Task/Model | Batch Size | torch 2.0 -no compile | torch 2.0 -compile |
---|---|---|---|
Image Classification/ConvNeXT | Unbatched | 11.758 | 7.335 |
Image Classification/ConvNeXT | 4 | 23.171 | 21.490 |
Image Classification/ResNet | Unbatched | 7.435 | 3.801 |
Image Classification/ResNet | 4 | 7.261 | 2.187 |
Object Detection/Conditional-DETR | Unbatched | 32.823 | 11.627 |
Object Detection/Conditional-DETR | 4 | 50.622 | 33.831 |
Image Segmentation/MobileNet | Unbatched | 9.869 | 4.244 |
Image Segmentation/MobileNet | 4 | 14.385 | 7.946 |
T4
Task/Model | Batch Size | torch 2.0 -no compile | torch 2.0 -compile |
---|---|---|---|
Image Classification/ConvNeXT | Unbatched | 32.137 | 31.84 |
Image Classification/ConvNeXT | 4 | 120.944 | 110.209 |
Image Classification/ResNet | Unbatched | 9.761 | 7.698 |
Image Classification/ResNet | 4 | 15.215 | 13.871 |
Object Detection/Conditional-DETR | Unbatched | 72.150 | 57.660 |
Object Detection/Conditional-DETR | 4 | 301.494 | 247.543 |
Image Segmentation/MobileNet | Unbatched | 22.266 | 19.339 |
Image Segmentation/MobileNet | 4 | 78.311 | 50.983 |
https://pytorch.org/get-started/pytorch-2.0/#user-experience
Usage
We’re going to test torch.compile
on google/vit-large-patch32-384
Colab Notebook: https://colab.research.google.com/drive/133DghyCIABxvYsQ5LV7TkqDsXKJMgwtP?usp=sharing
We will be needing pytorch profiler to see what exactly is going on when inferencing
And benchmark the model with
Compiling the model is as simple as
”reduce-overhead” is a mode that reduces the overhead of python with CUDA graphs, useful for small batches. Reduction of overhead can come at the cost of more memory usage, as we will cache the workspace memory required for the invocation so that we do not have to reallocate it on subsequent runs.
RAW Model
The Trace files .json
generated can be viewed at https://ui.perfetto.dev/ or chrome://tracing/
Compiled Model
Flash Attention
FlashAttention-2 is a faster and more efficient implementation of the standard attention mechanism that can significantly speedup inference by:
- additionally parallelizing the attention computation over sequence length
- partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them
FlashAttention-2 supports inference with Llama, Mistral, and Falcon models.
FlashAttention-2 currently supports:
- Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
- Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
- All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.
Make sure to install the latest flash attention
Or install one from https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6
Usage with transformers
Scaled Dot Product Attention (PyTorch)
https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
PyTorch's Flash attention 2 (torch==2.2.0.dev20230915+cu121
) runs at 490 ms/iter
Tri Dao's Flash attention 2 (flash-attn==2.2.2
) runs at 483 ms/iter
GPU: NVIDIA A100-SXM4-40GB
Nvidia driver version: 525.105.17
OR
This will use the fused kernel by default
It can be forced as well with with torch.backends.cuda.sdp_kernel(enable_math=False):
Colab Notebook:
https://colab.research.google.com/drive/1AD4rdEp1FxF6gmcVnHvp_nB1A55I0YWu?usp=sharing
BitsAndBytes
In the machine learning jargon FP32 is called full precision (4 bytes), while BF16 and FP16 are referred to as half-precision (2 bytes). On top of that, the int8 (INT8) data type consists of an 8-bit representation that can store 2^8 different values (between [0, 255] or [-128, 127] for signed integers).
https://github.com/TimDettmers/bitsandbytes
Source Code in Transformers
https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py
Colab Notebook: https://colab.research.google.com/drive/1AD4rdEp1FxF6gmcVnHvp_nB1A55I0YWu?usp=sharing
FP16
- FP32 (Float32): This is the standard IEEE 32-bit floating point representation. It has 23 bits for the mantissa, 8 bits for the exponent, and 1 sign bit. FP32 offers a wide range of representable values with good precision, making it the default choice for many computations. However, it requires more memory and computational resources compared to lower-precision formats.
- FP16 (Float16): FP16 cuts the number of bits in half compared to FP32, with 10 bits for the mantissa, 5 bits for the exponent, and 1 for the sign. The trade-off is a much smaller range of representable numbers and reduced precision. FP16 can cause numerical issues like overflow and underflow, where very large or small numbers respectively can't be accurately represented and lead to errors such as NaN (Not a Number).
- BF16 (BFloat16): To address the limitations of FP16 while not compromising too much on range, BF16 uses 8 bits for the exponent (like FP32) but only 7 bits for the mantissa. This keeps a wide dynamic range similar to FP32 but with slightly lower precision. BF16 strikes a balance that is suitable for many deep learning tasks where the wide range is more important than extreme precision.
- TF32 (TensorFloat-32): Exclusive to NVIDIA's Ampere architecture, TF32 offers a new format with 19 bits: 8 for the mantissa and 10 for the exponent. TF32 aims to balance range and precision by using more exponent bits than BF16 and fewer mantissa bits than FP32. It's used internally during specific GPU operations and offers the performance of FP16 with the range close to FP32.
Here we are trying to load a 1.7B model in FLOAT16
INT8
Paper: https://arxiv.org/abs/2208.07339
Absmax quantization is one type of quantization that scales numerical values to fit within the range of a target data type, such as int8.
Example of Absmax Quantization:
Assume you have a vector: [1.2, -0.5, -4.3, 1.2, -3.1, 0.8, 2.4, 5.4]
. Here's how you perform absmax quantization:
- Find the absolute maximum number in the vector, which is
5.4
. - Determine the range of the target quantization format, for int8 this is
[-127, 127]
. - Calculate the scaling factor by dividing the maximum possible int8 value (
127
) by the absolute maximum number (5.4
), getting approximately23.5
. - Quantize each number in the original vector by multiplying it by the scaling factor, giving you the quantized vector:
[28, -12, -101, 28, -73, 19, 56, 127]
.
LLM.int8(): zero degradation matrix multiplication for Large Language Models
The LLM.int8() method is designed for large language models (LLMs) to enable efficient inference (making predictions) with less computational resources without significant degradation in performance.
The operations of the LLM.int8() method are as follows:
- Outlier Extraction: Identify and extract outliers (elements exceeding a certain threshold) from the input data.
- Mixed-Precision Matrix Multiplication: Perform matrix multiplication where the outliers are processed using FP16 (for accuracy) and the non-outliers using int8 (for efficiency).
- Combination: Dequantize the results from the int8 computations back to FP16 and add them to the outlier computations, thus producing the final result.
The rationale behind LLM.int8() is to store data in int8 to save memory space while performing computations in FP16 to maintain the accuracy of the results. BLOOM-176B, a large language model, when using LLM.int8(), was found to be only slightly slower (about 15% to 23%) than its FP16 counterpart, which indicates that it is a viable solution for maintaining performance while being more resource-efficient.
This method is particularly useful when aiming to run large models on hardware with limited memory capacity or when trying to reduce costs associated with memory consumption and computation time. In your course, you can use this content to illustrate how numerical representation choices directly affect both the storage and computational aspects of model deployment
INT4
Quantization technique is same as INT8 but now instead you have -8 to 7 as the values
NF4
Consider a weight in the neural network that is represented by a 32-bit floating-point number, with the value 0.5678
.
Now, we want to quantize this weight to a 4-bit integer. In our example, a 4-bit integer can represent 16 evenly spaced levels between -1
and 1
. The levels are:
-1.0, -0.8667, -0.7333, -0.6, -0.4667, -0.3333, -0.2, -0.0667,
0.0667, 0.2, 0.3333, 0.4667, 0.6, 0.7333, 0.8667, 1.0
To quantize the weight 0.5678
, we find the nearest level in our 4-bit representation, which is 0.6
.
Let's say that the level 0.6
is associated with the 4-bit integer value 13
. We would then store the integer 13
instead of the original 32-bit floating-point number (0.5678
).
In a computation, whenever this weight is needed, we dequantize the 4-bit integer back to the original level 0.6
before using it in calculations. This means we are introducing a quantization error, which is the difference between the dequantized value and the original floating-point value:
Dequantization error = Dequantized value - Original value
= 0.6 - 0.5678
= 0.0322
In this example, the error is 0.0322
. This is approximately one-fourth of the distance between two quantization levels (since 1 / (1/0.1333) = 1 / 7.5 ≈ 0.1333
, and 0.1333 / 4 ≈ 0.0333
).
For enabling nested quantization, you can use the bnb_4bit_use_double_quant
argument in BitsAndBytesConfig
. This will enable a second quantization after the first one to save an additional 0.4 bits per parameter.
- Both NF4 and FP4 show comparable performance in terms of inference speed, memory consumption, and the quality of content generation.
- NF4 demonstrates better stability at lower temperatures with the LLaMA2 series of models. This stability is crucial for maintaining model performance in changing thermal environments.
- FP4, and its variant FP4-DQ (where DQ stands for Double Quantization), is found to be more appropriate for the Falcon series of models.
- Generally, 4-bit quantized models are more sensitive to temperature variations compared to unquantized models, with greater sensitivity noted in the temperature range of 0.5 to 0.8.
Premise of NF4:
- NF4 is designed to optimize quantization for neural network weights that typically exhibit a zero-centered normal distribution, eliminating the need for expensive quantile estimates.
- This approach is feasible because input tensors can be transformed to adhere to a fixed distribution up to a quantization constant.
- The method's validity is supported by the Shapiro-Wilk test, indicating that the vast majority (approximately 92.5%) of the LLaMA neural network weights follow a normal distribution.
Key Concepts Explained:
- NF4 (4-bit NormalFloat):
- An optimal quantization data type specifically for data that follows a normal distribution.
- It uses Quantile Quantization and is estimated using a quantile approximation algorithm based on the zero-mean normal distribution of pre-trained neural network weights.
- Double Quantization (DQ):
- A method that further quantizes the quantization constants (used for initial quantization) into a lower-precision format, resulting in memory savings.
- For example, saving approximately 3 GB of memory for a 65 billion parameter model.
- FP4-DQ involves applying FP4 quantization to the initial quantization constants, reducing these constants to 8-bit precision (FP8).
Additional Context for NF4:
- NF4 aims to create a quantization scheme that's tailored to the characteristics of neural networks, where weights and activations often assume a distribution close to the normal distribution.
- The adoption of NF4 and FP4-DQ is part of efforts like QLoRA to optimize the fine-tuning of quantized large language models (LLMs), maintaining efficiency while keeping computational overhead low.
INT8/INT4 with Flash Attention 2
FlashAttention-2 can be combined with other optimization techniques like quantization to further speedup inference. For example, you can combine FlashAttention-2 with 8-bit or 4-bit quantization:
CPU Offloading
https://huggingface.co/docs/accelerate/usage_guides/quantization CPU and Disk Offloading
https://huggingface.co/docs/transformers/main_classes/quantization#offload-between-cpu-and-gpu
Offload between cpu and gpu
One of the advanced use case of this is being able to load a model and dispatch the weights between CPU
and GPU
. Note that the weights that will be dispatched on CPU will not be converted in 8-bit, thus kept in float32
. This feature is intended for users that want to fit a very large model and dispatch the model between GPU and CPU.
AWQ
AWQ method has been introduced in the *AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration* paper. With AWQ you can run models in 4-bit precision, while preserving its original quality (i.e. no performance degradation) with a superior throughput that other quantization methods presented below - reaching similar throughput as pure float16
inference.
Hands-On
https://colab.research.google.com/drive/1ppBB9f1l7NBEuvjs3BTwn43Kry8rneB4?usp=sharing
GPTQ
https://github.com/PanQiWei/AutoGPTQ
Original Paper Code: https://github.com/IST-DASLab/gptq
Paper: https://arxiv.org/abs/2210.17323
GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
The benefits of this scheme are twofold:
- Memory savings close to x4 for int4 quantization, as the dequantization happens close to the compute unit in a fused kernel, and not in the GPU global memory.
- Potential speedups thanks to the time saved on data communication due to the lower bitwidth used for weights.
GPTQ uses a Cholesky decomposition, a numerically stable method for solving certain mathematical problems. It involves precomputing some required information from the matrix using the Cholesky method. This approach, combined with a slight “dampening” (adding a small constant to diagonal elements of the matrix), helps the algorithm to avoid numerical issues.
The full algorithm can be summarized in a few steps:
- The GPTQ algorithm begins with a Cholesky decomposition of the Hessian inverse (a matrix that helps decide how to adjust the weights)
- It then runs in loops, handling batches of columns at a time.
- For each column in a batch, it quantizes the weights, calculates the error, and updates the weights in the block accordingly.
- After processing the batch, it updates all remaining weights based on the block’s errors.
Basically it will try to quantize the entire column in the weight matrix, if it finds that it creates a lot of error in the model output (deviations pre quantization) then that column is not quantized, if it’s within threshold it quantizes it.
Reference: https://mlabonne.github.io/blog/posts/4_bit_Quantization_with_GPTQ.html
This supports 2,3,4 bits quantizations
Hands-On
https://colab.research.google.com/drive/153rJG9dVDk8OnvWRX6gQBtnHkRYjY3-w?usp=sharing
But what’s this group size?
When we put a neural network "parameter" from 32-bit or 16-bit floating numbers, all the way down to int4, or int3, there is a need for a scaling factor which would translate 16 combinations of int4, or 8 combinations of int3, into an essentially unlimited range of floating point numbers.
We need "scaling weights" which allows us to translate these integers into a large variety of values.
This is done by assigning "scaling weights" to a collection of neural network "parameters", let's just say one exponential and another linear (so that we can obtain a zero, somehow). The natural way of thinking is that we assign it to a full "row" (let's say, for example, 4096 of them) of these "parameters".
The groupsize 128 (or 32, for the matter), is making it such the scaling parameters are shared by not all members of a row, but just 32, or 128 of those "parameters".
The effect of having less parameters sharing a scaling function is that the scaling become more accurate and efficient.
you can use model.config.model_type
to compare with the table below to check whether the model you use is supported by auto_gptq
.
for example, model_type of WizardLM
, vicuna
and gpt4all
are all llama
, hence they are all supported by auto_gptq
.
For quantizing a model using auto-gptq, we need to pass a dataset to the quantizer. This can be achieved either by passing a supported default dataset among ['wikitext2','c4','c4-new','ptb','ptb-new']
or a list of strings that will be used as a dataset.
model type | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt |
---|---|---|---|---|---|
bloom | ✅ | ✅ | ✅ | ✅ | |
gpt2 | ✅ | ✅ | ✅ | ✅ | |
gpt_neox | ✅ | ✅ | ✅ | ✅ | ✅https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt |
gptj | ✅ | ✅ | ✅ | ✅ | ✅https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt |
llama | ✅ | ✅ | ✅ | ✅ | ✅ |
moss | ✅ | ✅ | ✅ | ✅ | ✅https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt |
opt | ✅ | ✅ | ✅ | ✅ | |
gpt_bigcode | ✅ | ✅ | ✅ | ✅ | |
codegen | ✅ | ✅ | ✅ | ✅ | |
falcon(RefinedWebModel/RefinedWeb) | ✅ | ✅ | ✅ | ✅ |
PEFT (Parameter Efficient FineTuning)
- LoRA: LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- Prefix Tuning: Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
- P-Tuning: GPT Understands, Too
- Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
- AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
- (��)3: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
- MultiTask Prompt Tuning: Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning
- LoHa: FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning
- LoKr: KronA: Parameter Efficient Tuning with Kronecker Adapter based on Navigating Text-To-Image Customization:From LyCORIS Fine-Tuning to Model Evaluation implementation
- LoftQ: LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models
- OFT: Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Hardware: Single A100 80GB GPU with CPU RAM above 64GB
Model | Full Finetuning | PEFT-LoRA PyTorch | PEFT-LoRA DeepSpeed with CPU Offloading |
---|---|---|---|
bigscience/T0_3B (3B params) | 47.14GB GPU / 2.96GB CPU | 14.4GB GPU / 2.96GB CPU | 9.8GB GPU / 17.8GB CPU |
bigscience/mt0-xxl (12B params) | OOM GPU | 56GB GPU / 3GB CPU | 22GB GPU / 52GB CPU |
bigscience/bloomz-7b1 (7B params) | OOM GPU | 32GB GPU / 3.8GB CPU | 18.1GB GPU / 35GB CPU |
LORA (Low Rank Adaptation)
DeepSpeed
https://huggingface.co/docs/transformers/main_classes/deepspeed
DeepSpeed implements everything described in the ZeRO paper (Zero Redundancy Optimizer). Currently it provides full support for:
- Optimizer state partitioning (ZeRO stage 1)
- Gradient partitioning (ZeRO stage 2)
- Parameter partitioning (ZeRO stage 3)
- Custom mixed precision training handling
- A range of fast CUDA-extension-based optimizers
- ZeRO-Offload to CPU and NVMe
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.
- ZeRO Stage 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
- ZeRO Stage 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
- ZeRO Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
With PyTorch Lightning
Lightning Fabric
Fabric is the fast and lightweight way to scale PyTorch models without boilerplate. Convert PyTorch code to Lightning Fabric in 5 lines and get access to SOTA distributed training features (DDP, FSDP, DeepSpeed, mixed precision and more) to scale the largest billion-parameter models.
Fabric differentiates itself from a fully-fledged trainer like Lightning’s Trainer in these key aspects:
Fast to implement There is no need to restructure your code: Just change a few lines in the PyTorch script and you’ll be able to leverage Fabric features.
Maximum Flexibility Write your own training and/or inference logic down to the individual optimizer calls. You aren’t forced to conform to a standardized epoch-based training loop like the one in Lightning Trainer. You can do flexible iteration based training, meta-learning, cross-validation and other types of optimization algorithms without digging into framework internals. This also makes it super easy to adopt Fabric in existing PyTorch projects to speed-up and scale your models without the compromise on large refactors. Just remember: With great power comes a great responsibility.
Maximum Control The Lightning Trainer has many built-in features to make research simpler with less boilerplate, but debugging it requires some familiarity with the framework internals. In Fabric, everything is opt-in. Think of it as a toolbox: You take out the tools (Fabric functions) you need and leave the other ones behind. This makes it easier to develop and debug your PyTorch code as you gradually add more features to it. Fabric provides important tools to remove undesired boilerplate code (distributed, hardware, checkpoints, logging, …), but leaves the design and orchestration fully up to you.
BitsAndBytes: https://lightning.ai/blog/8-bit-quantization-with-lightning-fabric/
https://lightning.ai/docs/fabric/latest/fundamentals/precision.html#quantization-via-bitsandbytes
LLM Optimization with Pytorch 2.0
This came in 17 hours after i finished creating this content 😭
https://pytorch.org/blog/accelerating-generative-ai-2/
Make sure to install the latest nightly release of pytorch 2 !
Step 1: torch.compile
and kv-cache
Step 2: INT8 Quantization
Step 3: Speculative Decoding
- Select a small model and a large one of your preference. Make sure they share the same tokenizer, so that we can meaningfully compare the logits of the two models.
- Generate a specific number of candidate new tokens with the small model, say 3. This involves running the forward pass on the small model 3 times.
- Use the larger model to forward pass the prospective new input (combining the original with the 3 new tokens). This returns lists of potential tokens with their corresponding probabilities for all input tokens.
- Decode the last 4 tokens using greedy decoding (3 new tokens plus an additional one from the forward pass of the large model). Compare the decoded tokens from the large model with the candidate new tokens, starting from left to right. If the tokens match, we accept them and append them to the original input. Continue this process until the first mismatch occurs, at which point we append the token from the large model to the input. This updated input is then passed through the small model to generate 3 more tokens, and the entire process is repeated.
Step 4: INT4 Quantization and GPTQ
Of course, if reducing the weights down from 16 bits to 8 bits allows for speedups by reducing the number of bytes we need to load, reducing the weights down to 4 bits would result in even larger speedups!
Unfortunately, when reducing weights down to 4-bits, the accuracy of the model starts to become a much larger concern. From our preliminary evals, we see that although using int8 weight-only quantization has no perceptible accuracy degradation, using int4 weight-only quantization does.
Vanilla (No Optimizations)
Disable KV Cache by commenting out this: https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/generate.py#L154
Torch Compile
KV Caching is by default enabled!
https://github.com/pytorch-labs/gpt-fast/blob/3bcaaaf068d112d534f335ec21a17d7b8b5551bf/model.py#L94
With INT8
With INT4
Speculative Sampling
Here the draft model is simply the int8 version of the same model
Here we have the full FP16 model but with INT8 as the draft model, the speed and quality is just ❤️
NOTES
- MMLU Benchmark: https://huggingface.co/blog/evaluating-mmlu-leaderboard
- All options for BitsAndBytes: https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf
- BitsAndBytes in HF: https://huggingface.co/blog/hf-bitsandbytes-integration
- Optimizing SAM: https://pytorch.org/blog/accelerating-generative-ai/
- GPT KV Cache: https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/
- PyTorch Optimization Guide: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html