How does Apple Unified Memory affect local LLM?

Explain why the total memory on Mac cannot be used as video memory, and how to choose the suitable model for 16GB, 32GB, 64GB, and 128GB machines.

Unified memory does not mean “all models can be used”

Apple Silicon’s unified memory is used by the CPU, GPU, system, apps, and background services. Its advantage is that the CPU and GPU share the same high-speed memory, and the deployment experience of many local model tools is simpler than that of traditional independent graphics cards; but it does not mean that all 32GB, 64GB or 128GB can be used as model weight space.

When actually selecting a model, you need to leave allowance for macOS, browsers, IDEs, inference services, KV caches, and temporary tensors. If a 32GB Mac weights the model to 28GB, it may seem like it can just fit in. However, in actual operation, the memory may be frequently compressed, swapped to disk, or the speed may drop significantly. Local LLM estimates available space more conservatively in Mac mode.

What are 16GB, 32GB, 64GB and 128GB suitable for?

16GB Mac is more suitable for small models and low to medium quantization, such as the Q4/Q5 versions of 3B, 4B, and 7B. It can satisfy lightweight chat, summary, translation and simple code assistance, but is not suitable for long context or visual models. 32GB can cover more 7B/14B models, and can also try more stable quantized versions, which is a common starting point for ordinary developers.

After 64GB, users can try larger MoE or 30B models, which can also leave room for long contexts and multitasking. 128GB is suitable for larger experiment scope, such as highly quantized large models, multiple model version comparisons, long contexts, and complex local workflows. But even with 128GB, you still need to look at model weights, KV cache, backend, and speed.

Differences between Metal, MLX and llama.cpp

Common backends on Mac include llama.cpp Metal, MLX, Ollama, and LM Studio. The underlying optimizations of these tools are different, and the speed of the same model on different backends may be different. The MoE model particularly relies on kernel implementation, and the speed cannot be inferred using parameter quantities alone.

Therefore, the tok/s on the recommendation page should be a conservative estimate or range, not an absolute promise. What users really need is a screening direction: which models can be loaded with a high probability, which models require more memory, and which models can only run theoretically but have an unstable experience.

Why 128GB should unlock the larger model

If 32GB, 64GB, and 128GB give the exact same quality-first recommendation, it usually means that the sorting algorithm is not properly taking advantage of the capacity change. Larger unified memory should allow models with higher parameter counts, higher quantization, or longer context to enter the candidate set. Quality-first models in particular should reflect this.

But "bigger" isn't the only goal either. Recommender systems need to rank between model quality, task matching, execution mode, speed confidence and memory margin. A 128GB Mac can run larger models, but it does not mean that the largest model should be recommended blindly for every scenario; the best choices for programming, general, mathematical, and visual tasks may be different.

What should Mac users choose?

Ordinary users can start with the balanced mode. If the results are all full GPU/unified memory runs, the configuration is stable; if a large number of results show partial offloading or low confidence speed, you need to reduce quantization, shorten the context, or choose a smaller model. When doing programming tasks, the code capabilities and context length of the model are important; when doing visual tasks, make sure the model is indeed a multi-modal model.

The value of Local LLM lies in making these judgments into visual input, rather than letting users guess one by one on Hugging Face. The blog post explains the principles, and the tool combines the live model list with the user's hardware to give current recommendations.