Finetuning Transformer Models

Pretraining is the act of training a model from scratch using training data. The weights of a model are randomly initialized, and the training starts without any prior knowledge and updated during the process.
Finetuning is usually done with a smaller amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.

Finetuning is the process of adapting a pretrained model to specific tasks using a smaller set of data.
The knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.
The finetuning lifecycle is very similar to any other machine learning workflow with minor differences.
The steps in the workflow are defining the problem, preprocessing the data, selecting the model, tuning the hyperparameters, training,testing, validating and serving.

Pretraining and finetuning transformer models involves manipulating large matrices, a computationally expensive task. GPUs or Graphic Processing Units are necessary for this as they excel in parallel computing and dramatically speed it up.
GPUs (Graphic Processing Units) use VRAMs (or Video Random Access Memory) that are optimized for handling large amounts of data required for rendering graphics, making them effective at tasks involving transformer models such as model inference and finetuning.

Any matrix of size (m,n) and rank r can be decomposed into two matrices of sizes (m,r) and (r,n) respectively. This is know as rank decomposition.
LoRA uses rank decomposition to decompose the weight update matrix into lower rank matrices, thereby reducing the number of training parameters needed for finetuning.

While less expensive than pretraining, finetuning is still computationally expensive as it involves storing and manipulating huge matrices of pretrained model weights.
LoRA (Low Rank Adaptation) and QLoRA (Quantization + Low Rank Adaptation) are two techniques that make finetuning faster and cheaper.

Quantization is a technique that reduces the memory needed to store weight matrices by reducing the numerical precision of their elements.
Typically, this involves converting higher precision data types such as float64 or float32 to lower precision data types such as int8 or uint4.

Two steps are involved in finetuning LLMs with LoRA (Low Rank Adaptation):

Freezing the original pretrained weights
Rank-decomposing the update matrix to two matrices, reducing the total number of parameters that need to be updated significantly.

QLoRA combines quantization with LoRA to achieve significantly higher efficiency in finetuning LLMs.
The original QLoRA implementation involves a 4-bit quantization that reduces the precision of weights to nf4, a 4-bit “Normal Float” data type, and double quantization where other constants in the implementation are also quantized.

Learn more on Codecademy