Nf4.rar 【360p】

: Compresses 16-bit weights to 4 bits, effectively reducing VRAM usage by ~75% (e.g., a 65B parameter model can be loaded with ~35GB instead of ~130GB).

: To reduce the memory footprint of LLMs (like Llama) enough to fit on a single GPU (e.g., a 24GB RTX 3090) while maintaining full 16-bit performance. NF4.rar

The term "NF4" is central to this "long paper" which revolutionized how large language models (LLMs) are fine-tuned on consumer hardware. : Compresses 16-bit weights to 4 bits, effectively

: A process that quantizes the quantization constants themselves to save additional memory. : A process that quantizes the quantization constants

: Neural network weights typically follow a normal distribution. NF4 concentrates its 16 "bins" where most weights exist (near zero), minimizing rounding errors.

The paper explains why NF4 is superior to standard 4-bit integers (Int4) or floating-point (Float4) formats: