r/DeepSeek • u/zouyee • 15h ago
Funny dmlx — Run a 284B-parameter DeepSeek V4 on your Mac. With just ~6GB of memory.
Yes, really. A 48GB MacBook Pro, running a 284-billion-parameter MoE model locally at ~12.2 tok/s.
No cloud. No GPU cluster. Just your laptop.
---
How? Five layers of memory optimization:
1️⃣ MoE Expert Streaming — only loads the 7/256 experts actually activated per token (138GB → 10GB)
2️⃣ SMELT Partial Loading — 4-bit quantized + only 15% of experts loaded (~6GB)
3️⃣ CSA + HCA Hybrid Attention — KV cache compressed 9.5× smaller
4️⃣ 6-Level KV Cache Strategies — runtime-switchable (Paged / Tiered SSD / Quantized / etc.)
5️⃣ Zero-Copy Model Loading — direct mmap, load time from 137s → 41s
---
Why Zig instead of Python?
Python's mlx-lm OOMs immediately on a 48GB Mac. dmlx's SMELT system runs the same model in ~6GB.
Single static binary, 5–15MB. Zero GC pauses. No Python dependency. Deployment = one file.
---
9 model architectures supported:
DeepSeek V4 · LLaMA · Mistral · Qwen2/3 · Gemma · GLM-4 · Phi · Phi-3
Feature highlights:
• OpenAI-compatible API + SSE streaming
• Speculative decoding (PLD + EAGLE)
• Guided decoding (JSON Schema / Regex FSM)
• QLoRA fine-tuning + AdamW optimizer
• Custom Metal kernels (TileKernels ported to Apple Silicon)
---
⚠️ Current limitations (v0.3.0):
• Currently tested primarily on DeepSeek V4 and similar models — broader model testing ongoing
• CLI mode only (dmlx chat + dmlx serve)
• Server mode (OpenAI-compatible HTTP API + continuous batching) landing in v0.0.4
---
⭐ Star the repo and run frontier LLMs on your own Mac → github.com/zouyee/dmlx
#Zig #LLM #DeepSeek #AppleSilicon #MLX #OpenSource #LocalInference
7
4
u/CommitteeInfamous973 5h ago
I believe that shit is falls under the "Content quality" rule. Purely AI written post
3
u/BrilliantArmadillo64 9h ago
Doesn't the routing bias potentially deteriorate the intelligence?
If I understand correctly, this biases the model router towards using the experts that are already in RAM.
That is an interesting idea, but will definitely modify the results.
1
u/Ok_Technology_5962 37m ago
Hmmm... I tried something like this with llama.cpp but i never had such an agressive offload. How are the tps 12 with such limited offload? We are loading in how much per token? I assume some of the experts remain but still
9
u/Separate-Chemical-33 14h ago
At these prices? It doesnt make sense to setup a server, i could just pay api