Pretraining and finetuning transformer models involves manipulating large matrices, a computationally expensive task. GPUs or Graphic Processing Units are necessary for this as they excel in parallel computing and dramatically speed it up.
GPUs (Graphic Processing Units) use VRAMs (or Video Random Access Memory) that are optimized for handling large amounts of data required for rendering graphics, making them effective at tasks involving transformer models such as model inference and finetuning.
(m,n)
and rank r
can be decomposed into two matrices of sizes (m,r)
and (r,n)
respectively. This is know as rank decomposition.float64
or float32
to lower precision data types such as int8
or uint4
.Two steps are involved in finetuning LLMs with LoRA (Low Rank Adaptation):
nf4
, a 4-bit “Normal Float” data type, and double quantization where other constants in the implementation are also quantized.