Codecademy Logo

Finetuning Transformer Models

Pretraining and Finetuning

  • Pretraining is the act of training a model from scratch using training data. The weights of a model are randomly initialized, and the training starts without any prior knowledge and updated during the process.
  • Finetuning is usually done with a smaller amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.

What is finetuning?

  • Finetuning is the process of adapting a pretrained model to specific tasks using a smaller set of data.
  • The knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.
  • The finetuning lifecycle is very similar to any other machine learning workflow with minor differences.
  • The steps in the workflow are defining the problem, preprocessing the data, selecting the model, tuning the hyperparameters, training,testing, validating and serving.

Transformers and GPUs

  • Pretraining and finetuning transformer models involves manipulating large matrices, a computationally expensive task. GPUs or Graphic Processing Units are necessary for this as they excel in parallel computing and dramatically speed it up.

  • GPUs (Graphic Processing Units) use VRAMs (or Video Random Access Memory) that are optimized for handling large amounts of data required for rendering graphics, making them effective at tasks involving transformer models such as model inference and finetuning.

What’s low rank adaptation?

  • Any matrix of size (m,n) and rank r can be decomposed into two matrices of sizes (m,r) and (r,n) respectively. This is know as rank decomposition.
  • LoRA uses rank decomposition to decompose the weight update matrix into lower rank matrices, thereby reducing the number of training parameters needed for finetuning.

Why LoRa and QLoRA?

  • While less expensive than pretraining, finetuning is still computationally expensive as it involves storing and manipulating huge matrices of pretrained model weights.
  • LoRA (Low Rank Adaptation) and QLoRA (Quantization + Low Rank Adaptation) are two techniques that make finetuning faster and cheaper.

What is quantization

  • Quantization is a technique that reduces the memory needed to store weight matrices by reducing the numerical precision of their elements.
  • Typically, this involves converting higher precision data types such as float64 or float32 to lower precision data types such as int8 or uint4.

How is LoRA implemented?

Two steps are involved in finetuning LLMs with LoRA (Low Rank Adaptation):

  • Freezing the original pretrained weights
  • Rank-decomposing the update matrix to two matrices, reducing the total number of parameters that need to be updated significantly.
This is an image demonstrating the steps for a full finetune versus LoRA. The full finetune graphic to the left shows how the weight matrix W is updated using the update matrix Delta W, which is as big as W. However, LoRA, demonstrated on the right side of the image, uses rank decomposition to decompose the update matrix Delta W into smaller matrices A and B. The total number of parameters needed to train is much smaller here.

What is QLoRA?

  • QLoRA combines quantization with LoRA to achieve significantly higher efficiency in finetuning LLMs.
  • The original QLoRA implementation involves a 4-bit quantization that reduces the precision of weights to nf4, a 4-bit “Normal Float” data type, and double quantization where other constants in the implementation are also quantized.

Learn more on Codecademy