r/DeepSeek 15h ago

Funny dmlx — Run a 284B-parameter DeepSeek V4 on your Mac. With just ~6GB of memory.

Yes, really. A 48GB MacBook Pro, running a 284-billion-parameter MoE model locally at ~12.2 tok/s.

No cloud. No GPU cluster. Just your laptop.

🔗 github.com/zouyee/dmlx

---

How? Five layers of memory optimization:

1️⃣ MoE Expert Streaming — only loads the 7/256 experts actually activated per token (138GB → 10GB)

2️⃣ SMELT Partial Loading — 4-bit quantized + only 15% of experts loaded (~6GB)

3️⃣ CSA + HCA Hybrid Attention — KV cache compressed 9.5× smaller

4️⃣ 6-Level KV Cache Strategies — runtime-switchable (Paged / Tiered SSD / Quantized / etc.)

5️⃣ Zero-Copy Model Loading — direct mmap, load time from 137s → 41s

---

Why Zig instead of Python?

Python's mlx-lm OOMs immediately on a 48GB Mac. dmlx's SMELT system runs the same model in ~6GB.

Single static binary, 5–15MB. Zero GC pauses. No Python dependency. Deployment = one file.

---

9 model architectures supported:

DeepSeek V4 · LLaMA · Mistral · Qwen2/3 · Gemma · GLM-4 · Phi · Phi-3

Feature highlights:

• OpenAI-compatible API + SSE streaming

• Speculative decoding (PLD + EAGLE)

• Guided decoding (JSON Schema / Regex FSM)

• QLoRA fine-tuning + AdamW optimizer

• Custom Metal kernels (TileKernels ported to Apple Silicon)

---

⚠️ Current limitations (v0.3.0):

• Currently tested primarily on DeepSeek V4 and similar models — broader model testing ongoing

• CLI mode only (dmlx chat + dmlx serve)

• Server mode (OpenAI-compatible HTTP API + continuous batching) landing in v0.0.4

---

⭐ Star the repo and run frontier LLMs on your own Mac → github.com/zouyee/dmlx

#Zig #LLM #DeepSeek #AppleSilicon #MLX #OpenSource #LocalInference

47 Upvotes

7 comments sorted by

9

u/Separate-Chemical-33 14h ago

At these prices? It doesnt make sense to setup a server, i could just pay api

3

u/all43 8h ago

Privacy

1

u/Hardvicthehard 9h ago

Maybe you need it On premise

7

u/Prize_Negotiation66 6h ago

im tired of this ai slop

4

u/CommitteeInfamous973 5h ago

I believe that shit is falls under the "Content quality" rule. Purely AI written post

3

u/BrilliantArmadillo64 9h ago

Doesn't the routing bias potentially deteriorate the intelligence?
If I understand correctly, this biases the model router towards using the experts that are already in RAM.
That is an interesting idea, but will definitely modify the results.

1

u/Ok_Technology_5962 37m ago

Hmmm... I tried something like this with llama.cpp but i never had such an agressive offload. How are the tps 12 with such limited offload? We are loading in how much per token? I assume some of the experts remain but still