Local LLM VRAM Guide

From 6GB, 8GB, 12GB, 24GB to 48GB, explain how model parameter volume, quantized version, KV cache and system overhead combine to determine whether it can be loaded.

Let’s look at the conclusion first: video memory is not the only bottleneck

To determine whether a local large model can run, you cannot just look at the model parameters, nor just the graphics memory numbers. What really affects loading are model weights, quantization format, KV cache, running framework overhead, system background usage, and whether some layers need to be offloaded to CPU memory. A Q4 version of the 7B model may be relaxed, a 14B Q8 version will be tight; the same 24GB video memory, if the context is pulled from 4K to 32K, the KV cache will also eat up the available space.

The recommended idea for Local LLM is to make a runnability judgment first, and then sort by usage and model quality. After the user enters the video memory, memory, system and usage, the backend will estimate the weight occupancy, KV cache and running margin. If the model requires partial offload, the page will appear as partially offloaded, rather than wrapping it as "completely runnable". This is important to the average user, because being able to load and being usable are two different things.

6GB to 8GB: Prioritize small models and low quantization

6GB to 8GB video memory is more suitable for Q4 or Q5 quantized models of 1B, 3B, 4B, 7B. This range can satisfy lightweight question and answer, simple code explanation, summary, translation and low-concurrency personal use, but it is not suitable for stuffing all popular large models into it. Vision models, multi-modal models, and long-context tasks will hit the top faster because the image encoder and KV cache also occupy memory.

If the user only has 8GB of video memory, the recommendation page should be more conservative: rather recommend a small model that can run on the full GPU, rather than ranking a 30B model in a partially uninstalled form. Partial offloading can work in some scenarios, but the speed and experience depend on the CPU, memory bandwidth, PCIe, inference backend and system load, and cannot be used as the first choice answer for ordinary users.

12GB to 16GB: the sweet spot for most desktop users

12GB and 16GB are common configurations for many consumer-grade graphics cards, such as RTX 3060 12GB, RTX 4070 12GB, and RTX 4060 Ti 16GB. This range can usually cover a batch of Q4/Q5 quantified versions of 7B to 14B models, and there is room for options in programming, general Q&A, and lightweight RAG. For users, the key is not to pursue the largest model, but to find a version that can run stably, is not too fast, and has sufficient context length.

In this range, quantitative choices will directly affect the experience. Q4 is generally easier to fit, Q5/Q6 is more consistent but takes up more weight, and Q8 is close to high quality but significantly squeezes the headroom. The page of Local LLM should let the user see "required memory" and "running mode" instead of just giving a model name. In this way, users will know why the recommended results are ranked higher.

24GB to 48GB: Start pursuing higher quality and longer context

24GB of video memory is an important watershed for local LLM. It allows users to try larger 14B, 27B, 30B, 32B models, or run 7B/14B models in higher quantization and longer contexts. Above 48GB is more suitable for high-quality quantification, more experimental space, multi-model switching and longer context tasks.

But larger video memory still does not mean that all models can be used easily. The total parameters and activation parameters of the MoE model are different. The speed estimation depends on the active parameters and memory reading; the visual model also considers the image encoder; long context will increase the KV cache. The recommendation system needs to break down these differences and display them to prevent users from mistakenly thinking that "if the video memory is large enough, it must be fast."

When is memory and CPU offload useful?

When the video memory is not enough but the system memory is sufficient, some backends can put some layers into CPU memory. This allows the model to load, but tends to slow down, especially if the discrete graphics card needs to go over PCIe. Apple Silicon's unified memory doesn't have the same PCIe cliff, but is still affected by memory bandwidth, Metal/MLX kernel, and background footprint.

So the page should distinguish between full GPU, partial offload, and CPU only. What ordinary users need to know most is: full GPU running usually has the best experience; partial offloading can be used as an alternative; CPU only is mainly suitable for small models or offline testing, and is not suitable for chat experiences that expect high throughput.

How to make selections with Local LLM

After entering the video memory and memory, first check whether the top few are full GPU runs, and then look at the quantized version and speed confidence interval. If the first place is partial uninstallation, it means that it has an advantage in quality or download popularity, but it is not necessarily the daily choice with the best experience. Users can switch between "Quality First, Balanced, Long Context" to observe ranking changes.

For SEO pages, the goal of the article is not to memorize all models for users, but to explain the decision-making logic and bring users back to the recommended tool. After the concepts of video memory, quantization, context, and operation mode are clearly explained, users can enter their own hardware into the tool and the results obtained will be credible.

What local large models can be run with different graphics memories?