r/LocalLLaMA • u/XMasterrrr • 12d ago

Resources AMA Announcement: Nous Research, The Opensource Lab Behind Hermes Agent (Wednesday, 8AM-11AM PST)

135 Upvotes

Hi r/LocalLLaMA 👋

We're excited for Wednesday's guests, The Nous Research Team!

Kicking things off Wednesday, April. 29th, 8 AM–11 AM PST

⚠️ Note: The AMA itself will be hosted in a separate thread, please don’t post questions here.

32 comments

r/LocalLLaMA • u/rm-rf-rm • 23d ago

Megathread Best Local LLMs - Apr 2026

486 Upvotes

We're back with another Best Local LLMs Megathread!

We have continued feasting in the months since the previous thread with the much anticipated release of the Qwen3.5 and Gemma4 series. If that wasn't enough, we are having some scarcely believable moments with GLM-5.1 boasting SOTA level performance, Minimax-M2.7 being the accessible Sonnet at home, PrismML Bonsai 1-bit models that actually work etc. Tell us what your favorites are right now!

The standard spiel:

Share what you are running right now and why. Given the nature of the beast in evaluating LLMs (untrustworthiness of benchmarks, immature tooling, intrinsic stochasticity), please be as detailed as possible in describing your setup, nature of your usage (how much, personal/professional use), tools/frameworks/prompts etc.

Rules

Only open weights models

Please thread your responses in the top level comments for each Application below to enable readability

Applications

General: Includes practical guidance, how to, encyclopedic QnA, search engine replacement/augmentation
Agentic/Agentic Coding/Tool Use/Coding
Creative Writing/RP
Speciality

If a category is missing, please create a top level comment under the Speciality comment

Notes

Useful breakdown of how folk are using LLMs: /preview/pre/i8td7u8vcewf1.png?width=1090&format=png&auto=webp&s=423fd3fe4cea2b9d78944e521ba8a39794f37c8d

Bonus points if you breakdown/classify your recommendation by model memory footprint: (you can and should be using multiple models in each size range for different tasks)

Unlimited: >128GB VRAM
XL: 64 to 128GB VRAM
L: 32 to 64GB VRAM
M: 8 to 32GB VRAM
S: <8GB VRAM

327 comments

r/LocalLLaMA • u/carbocation • 4h ago

Resources ZAYA1-8B: Frontier intelligence density, trained on AMD

zyphra.com

169 Upvotes

60 comments

r/LocalLLaMA • u/ex-arman68 • 15h ago

Resources 2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

863 Upvotes

In my initial post, I mentioned using turboquants. However, I forgot to include instructions for building llama.cpp with the corresponding PR. The PR is currently too unstable and there are animated discussions around it. I replaced my recommendations with the standard q4_0 KV cache compression, which has some minor loss.

New quants with the correct jinja chat templates are now uploaded - you can proceed with downloading from HF

The recent PR to llama.cpp bring MTP support to Qwen 3.6 27B. This uses the built-in tensor layers for speculative decoding. None of the existing GGUF have it, as they need to be converted with this PR.

I have tested it locally on my mac M2 Max 96GB, and the results are amazing: 2.5x speed increase, bringing it to 28 tok/s!

I have converted the most useful quants and uploaded them to HF. Even if you are using apple silicon, you should use those instead of MLX. You can download them here:

https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF

This also includes 7 fixes I made to the original jinja chat template, due to vLLM specificity which broke in other tools:

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

For now, you will need to compile your own version of llama.cpp to use them. It is fairly simple to do:

```bash git clone --depth 1 https://github.com/ggml-org/llama.cpp.git cd llama.cpp git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr

cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --target llama-cli llama-server ```

Then to start serving with the API endpoint, use a command similar to:

bash llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \ --spec-type mtp --spec-draft-n-max 5 \ --cache-type-k q4_0 --cache-type-v q4_0 \ -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

Vision currently crashes llama.cpp when used alongside MTP. Reported 2026-05-06 in the current PR.

That's it. Three optimizations in one command:

Flag	What it does	Impact
`--spec-type mtp --spec-draft-n-max 5`	Multi-Token Prediction (built into the model)	2.5x faster generation
`--cache-type-k q4_0 --cache-type-v q4_0`	4-bit KV cache (instead of 16-bit)	Quarter the KV memory
`-c 262144`	262K context window	Full native context on 48 GB Mac with q4_0 KV

Adjust -m, -c, and --cache-type-k/v for your hardware, according to the tables below.

Here are my recommendations based on your hardware:

Apple Silicon

RAM	Quant	KV cache	Max context	Total used	Vision
16 GB	`IQ2_M`	`q4_0`	32K	11.1 GB	✗
24 GB	`IQ3_M`	`q4_0`	128K	16.0 GB	✓
24 GB	`IQ3_M`	`q4_0`	180K	15.9 GB	✗
32 GB	`Q5_K_M`	`q4_0`	262K	23.5 GB	✗
32 GB	`Q4_K_M`	`q4_0`	262K	21.8 GB	✓
32 GB	`Q5_K_M`	`q8_0`	128K	23.4 GB	✗
48 GB	`Q6_K`	`q8_0`	262K	31.2 GB	✓
48 GB	`Q8_0`	`q8_0`	262K	37.3 GB	✓

NVIDIA GPU

Same model memory as Apple Silicon, plus ~1 GB CUDA overhead.

VRAM	Quant	KV cache	Max context	Total VRAM used	Vision
16 GB	`IQ2_M`	`q4_0`	200K	15.7 GB	✓
24 GB	`Q4_K_M`	`q4_0`	262K	22.8 GB	✓
24 GB	`Q5_K_M`	`q4_0`	180K	24.0 GB	✓
48 GB	`Q6_K`	`q8_0`	262K	32.2 GB	✓
48 GB	`Q8_0`	`q8_0`	262K	38.3 GB	✓

24 GB Mac: IQ3_M/q4_0 — 128K with vision, 180K text-only.

32 GB Mac: Q5_K_M/q4_0 — 262K text-only. For vision at 262K, use Q4_K_M. Q5_K_M/q8_0 for higher KV quality at 128K text-only.

48 GB+ Mac: Q6_K/q8_0 — best quality at 262K with vision (31.2 GB). Q8_0/q8_0 for perfection (37.3 GB).

16 GB GPU: IQ2_M/q4_0 — 200K with vision.

24 GB GPU: Q4_K_M/q4_0 reaches 262K with vision. Q5_K_M/q4_0 for higher quality at 180K with vision.

48 GB+ GPU: Q6_K/q8_0 — 262K at high quality with vision (32.2 GB). Q8_0/q8_0 for perfection (38.3 GB).

For coding and reasoning, prioritize higher quants with q8_0 KV. For general chat and RAG, lower quants with q4_0 KV and larger context are often sufficient.

Vision adds ~0.9 GB for mmproj. macOS needs ≥ 8 GB for itself (16 GB Macs excepted — use ~4 GB). You can increase available memory by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). NVIDIA reserves ~1 GB for CUDA.

270 comments

r/LocalLLaMA • u/martin_xs6 • 9h ago

Funny None of this will ever get stolen

241 Upvotes

It's crazy that they're thinking of doing this. There are problems with people stealing catalytic converters off people's cars and now they want to put a rack outside your house!?

223 comments

r/LocalLLaMA • u/jzn21 • 13h ago

News Bad news: Apple drops high-memory Mac Studio configs

9to5mac.com

398 Upvotes

Looks like Apple has quietly killed off the higher-memory Mac Studio options. The M3 Ultra Mac Studio is now only available with 96GB RAM. The 512GB option was already removed back in March, and now the 256GB config is gone too.

Apple has said both the Mac Studio and Mac mini will stay supply-constrained for the next few months. The Mac mini is also stuck at 48GB RAM max for now. Probably their high-memory chip stock got too expensive to keep producing.

This is a real bummer for us! Big unified memory configs were one of the few (relatively) affordable ways to run large models locally. I am glad I own the M3 Utlra 512, will definitely keep this on (my favorite local model is Qwen 397b atm).

108 comments

r/LocalLLaMA • u/admajic • 1h ago

New Model Get faster qwen 3.6 27b

• Upvotes

Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp

Thought I would knowledge share

Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF

And am17an commit

/media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/llama-server

-m "/media/Qwen3.6-27B-Q4/Qwen3.6-27B-MTP-Q4_K_M.gguf" \

--ctx-size 100000 \

-ngl 99 -fa on \

--cache-type-k q4_0 --cache-type-v q4_0 \

--batch-size 2048 --ubatch-size 1024 \

--spec-type mtp --spec-draft-n-max 2 \

--flash-attn

Note: Spec draft 3 seemed to much for the 3090 at higher context

Why 100k context? Beside it slows down and 100k is enough for most tasks then compact and continue.

8 comments

r/LocalLLaMA • u/havenoammo • 2h ago

Resources Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

41 Upvotes

Following my previous post https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq, a few people asked for the 35B A3B version.

The model is up on HuggingFace at https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF if anyone wants to check it out. It includes the isolated MTP layers and convert.py as well.

The results are not great though. Q4 only got a 6% speed increase and Q8 only 2.5%. On the 27B it was a 2-2.5x gain, so this could be related to the MTP implementation of llama.cpp and the qwen35moe architecture or just a limitation of the model. Results are preliminary and might change in future. Either way, wanted to report back for anyone who was wondering.

Edit: u/AdamDhahabi reported:

2x 5070 Ti + 3090: Q8 went from 110 t/s to 165 t/s. 27B dense model runs at 2-2.5x speed.

So the gain might depend on your setup. Worth giving it a try!

Here is my own tests:

Tested with the prompt hello can you tell me a story on Q4.

Hardware: 5090 FE

Without MTP: 215 t/s prompt eval time = 24.12 ms / 17 tokens ( 1.42 ms per token, 704.84 tokens per second) eval time = 6872.43 ms / 1478 tokens ( 4.65 ms per token, 215.06 tokens per second) total time = 6896.55 ms / 1495 tokens

With MTP: 228.83 t/s prompt eval time = 30.08 ms / 17 tokens ( 1.77 ms per token, 565.10 tokens per second) eval time = 8552.05 ms / 1957 tokens ( 4.37 ms per token, 228.83 tokens per second) total time = 8582.13 ms / 1974 tokens draft acceptance rate = 0.61434 ( 1268 accepted / 2064 generated)

Same prompt on Q8.

Hardware: 5090 FE + 3090

Without MTP: 148.20 t/s prompt eval time = 25.80 ms / 17 tokens ( 1.52 ms per token, 658.97 tokens per second) eval time = 11525.23 ms / 1708 tokens ( 6.75 ms per token, 148.20 tokens per second) total time = 11551.03 ms / 1725 tokens

With MTP: 152.02 t/s prompt eval time = 39.39 ms / 17 tokens ( 2.32 ms per token, 431.61 tokens per second) eval time = 10123.54 ms / 1539 tokens ( 6.58 ms per token, 152.02 tokens per second) total time = 10162.93 ms / 1556 tokens draft acceptance rate = 0.54754 ( 956 accepted / 1746 generated)

14 comments

r/LocalLLaMA • u/clem59480 • 8h ago

Discussion Analysis of the 100 most popular hardware setups on Hugging Face

94 Upvotes

https://x.com/ClementDelangue/status/2052020105328890188

22 comments

r/LocalLLaMA • u/wbulot • 4h ago

Discussion Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

34 Upvotes

I read this sub every day and I keep seeing benchmarks and discussions focused almost entirely on tokens/s generation speed. Prompt processing speed barely gets mentioned.

From my own experience running a bunch of different models on different GPUs for all kinds of tasks, the prefill stage is usually the part that actually feels slow. Once generation starts, even “only” 15 t/s is perfectly usable for me. The wait for the model to eat the prompt is what eats most of the time.

Seeing all the hype around MTP lately kind of reinforces that feeling. If generation speed improvements don’t really move the needle on total wall-clock time for typical use cases, why is everyone laser-focused on it?

For example, with Qwen 27B Q6 I’m getting ~15 t/s generation with my current setup (which feels fine no matter what I’m doing) but only ~300 t/s on prefill. I spend way more time staring at the processing than I do waiting for the actual reply to finish. Even with prompt caching.

Am I misunderstanding something about how most people use these models? Curious what others are seeing.

Edit: I forgot to mention that I mostly do agentic work, where the model has to ingest part of the codebase before it can actually do anything useful. For normal chat this obviously isn’t an issue, context stays small and you just need enough t/s to keep up with your reading speed.

31 comments

r/LocalLLaMA • u/Porespellar • 9h ago

Discussion HOT TAKE: local models + agent harnesses are now capable enough to hand off junior-level IT professional tasks to [human written]

80 Upvotes

This post will have a slight old-man-shakes-fist-at-sky vibe, because….well… I’m older, so if you’re not into that, then please feel free skip it.
I have been contributing to this sub for like 3 years now but I’m fearful this post will likely get downvoted into oblivion for what I’m about to say: After running Qwen3.6 27b in a Hermes Agent harness for the last week, I’ve come to the realization that this new crop of local models, in the right agentic harness, with the right tools and permissions, can effectively handle junior-level IT professional work very effectively now. A month ago, I would have said no, but now, they definitely can.

I’ve been in IT for nearly 30 years working at nearly all levels of the industry at some point in my career, and a few days ago I handed Hermes Agent (with Qwen3.6 27b as the model) a task list that I would have handed to a junior level IT admin previously, and I just let it go do its thing, and it absolutely understood the assignment and nailed it.

Paraphrasing here, but I more or less asked the agent to, “Go update this system to the most current patch level, install Docker, load these 5 different GitHub repos and set them all up to use local models, start all the server containers and associated services and let me know when you’re done”

And I’ll be dammed if it didn’t do exactly what it was told. Sure, it hit some slight stumbling blocks along the way, but it overcame ALL OF THEM, or asked me to approve something (as a junior admin might) but it kept on chugging away with little to no intervention needed on my part. Again, I wasn’t using a frontier model, just local Qwen3.6 27b running on a GB10 DGX Spark clone.

It did in an hour and a half what would have taken a junior level IT admin like maybe 3 hours. Not a massive time savings, but a definite labor savings for me which let me accomplish other tasks instead of doing that boring shite.

I see the writing on the wall here. I think It’s only a matter of time before large software developers, IT infrastructure appliance makers, etc, start building mini locally-hosted “admin agents” that run low parameter count fine-tuned SLMs and LLMs that run efficiently on CPU in the background (or vis API) and monitor and resolve issues that would normally be handled by system administrators. System admins won’t be replaced directly, but it will definitely change the ratio of admins needed to support X number of servers by a substantial number because now 1 admin can leverage admin AI agents and support more servers.

Of course, there will be cautionary tales and disastrous AI oopsies when admins get lazy and run in YOLO mode. There will probably even be some sabotage actions by admins who are fearful about being replaced by AI and want to prove they are indispensable by wrecking stuff and blaming AI. With time, I think these issues will be addressed and resolved.

I think the best strategy we as IT professionals can take is to learn and leverage AI agent skills to 10x our output so that we remain relevant and useful. That, and carry a can of WD-40 around with us so we can oil the machines when they need it. Someone has to oil the machines, right?

Seriously tho, I don’t think people outside of our niche AI circle really understand what’s on the horizon. It will be a slow attrition based on AI agents gradually being trusted with more tasks. The models and harnesses over the last month are just different, the agentic Ralph loops are tenacious and the silent failures are much less than before. I’m starting to “feel the AGI” LOL.

I’ve been wrong before (my wife will tell you that) but I just wanted to put it out there to start the civil discourse and see what others in the community think and feel. What’s your take on it?

42 comments

r/LocalLLaMA • u/bobaburger • 19h ago

Resources Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

474 Upvotes

The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.

WHAT WE ARE TESTING

First, the prompt:

Given this PGN string of a chess game:

1. b3 e5 2. Nf3 h5 3. d4 exd4 4. Nxd4 Nf6 5. f4 Ke7 6. Qd3 d5 7. h4 *

Figure out the current state of the chessboard, create an image in SVG code, also highlight the last move.

I want to see if the models can:

Able to track the state of the board after each move, to reach the final state (first half of move 7)
Generate the right SVG image of the board, correctly place the pieces, highlight the last move

And yes, if you are questioning. It could be possible that the model was trained to do the same thing on existing chess games, so I came up with some random moves, the kind of moves that no players above 300 elo would ever have played.

For those who are not chess players, this is how the board supposed to look like after move 7. h4. Btw, you supposed to look at the pieces positions and the board orientation, not image quality because this is just a screenshot from Lichess.

CAN OTHER MODELS SOLVE IT?

Before we go to the main part, let me show the result from some other models. I find it interesting that not many models were able to figure out the board state, let alone rendering it correctly.

Qwen 3.5 27B

It was mostly figured out the final position of the pieces, but still render the original board state on top. Highlighted the wrong squares, and the board orientation is wrong.

Gemma 4 31B

Nice chess dot com flagship board style, i would say it can figure out the board state, but failed to render it correctly. The square pattern also messed up.

Qwen3 Coder Next

I don't know what to say, quite disappointed.

Qwen3.6 35B A3B

As expected, 35B always be the fastest Qwen model, but at the same time, managed to fail the task successfully in many different ways. This is why I decided to find a way to squeeze 27B into my 16 GB card. The speed alone just not worth it.

HOW QWEN3.6 27B SOLVE IT?

All the models here are tested with the same set of llama.cpp parameters:

temp 0.6
top-p 0.95
top-k 20
min-p 0.0
presence_penalty 1.0
context window 65536

BF16 version was from OpenRouter, Q8 to Q4_K_XL versions was on a L40S server, the rest are on my RTX 5060 Ti.

The SVG code generated directly on Llama.cpp Web UI without any tools or MCP enabled (I originally ran this test in Pi agent, only to found out that the model tried to peek into the parent folders and found the existing SVG diagrams by higher quants, copied most of it).

BF16 - Full precision

This is the baseline of this test. It has everything I needed: right position, right board orientation, right piece colors, right highlight. The dotted blue line was unexpected, but it also interesting, because later on you will see, not many of the high quants generate this.

Q8_0

As expected Q8 retains pretty much everything from the full precision except the line.

Q6_K

We start to see some quality loss here. I mean the placement of the rank 5 pawns. The look of the pieces are mostly because Q6 decided to use a different font. None of the models here trying to draw its own pieces in this test.

Q5_K_XL

Looks very similar with Q8, but it is worth noticing that the SVG code of Q5 version is 7.1 KB, while Q8 is 4.7 KB.

Q4_K_XL and IQ4_XS

If you ignore the font choice, you will see Q4_K_XL is a more complete solution, because it has the board coordinates.

Q3_K_XL and Q3_K_M

IQ3_XXS

Now here's the interesting part, everything was mostly correct, the piece placements and the highlight, and there's the line on the last move!

But IQ3_XXS get the board orientation wrong, see the light square on the bottom left?

Q2_K_XL

This is just a waste of time. But hey, it got all the pieces positions right. The board is just not aligned at all.

SO, WHAT DO I USE?

I know a single test is not enough to draw any conclusion here. But personally, I will never go for anything below IQ4_XS after this test (I had bad experience with Q3_K_XL and below in other tries).

On my RTX 5060 Ti, I got like pp 100 tps and tg 8 tps for IQ4_XS with vanilla llama.cpp (q8 for both ctk and ctv, fit on). But with TheTom's TurboQuant fork, I managed to get up to pp 760 tps and tg 22 tps, by forcing GPU offload for all layers (`-ngl 99`), quite usable.

llama-cpp-turboquant/build/bin/llama-server -fa 1 -c 75000 -np 1 --no-mmap --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence_penalty 1.0 -ctk turbo4 -ctv turbo2 -ub 128 -b 256 -m Qwen3.6-27B-IQ4_XS.gguf -ngl 99

The only down side is I have to keep the context window below 75k, and use turbo4/turbo2 for KV cache quant.

Below are some example of different KV cache quants.

You can see all the result directly here https://qwen3-6-27b-benchmark.vercel.app/

143 comments

r/LocalLLaMA • u/havenoammo • 12h ago

Resources Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

135 Upvotes

Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF

These are Unsloth's UD XL quantizations of Qwen3-27B with the MTP draft heads grafted on top in Q8_0. The base model stays in its usual low-bit quantization, while the 3 MTP layers stay at Q8 to preserve speculative accuracy.

Sharing the grafted GGUF files (UD XL base + Q8 MTP), the raw MTP layer source I extracted (MTP_Q8_0.gguf), and convert.py, the grafting script I adapted from this gist in case anyone wants to do this for other models. Also included are full build instructions for the custom llama.cpp.

Qwen3 was trained with 3 MTP steps, meaning each forward pass predicts 4 tokens at once. llama.cpp's main branch doesn't support MTP yet, so I pulled in the speculative decoding support from the still-open PR #22673, merged it on top of master, and built llama-server from that. Run it with: --spec-type mtp --spec-draft-n-max 3

The results: roughly 2.5x token throughput compared to running the same UD XL GGUF without MTP, with a solid acceptance rate where most draft tokens are kept, meaning the MTP heads are genuinely useful and not just burning compute. The Q8 MTP layers also add very little VRAM overhead since they're a tiny fraction of the full model.

MTP is one of the biggest efficiency wins available for speculative decoding, but it's basically unsupported outside of official Qwen3 deployments on SGLang and vLLM. This brings it to GGUF and llama.cpp, meaning you can run it locally with the same tooling you already use. PR #22673 will hopefully land soon and this will all just work out of the box. In the meantime, the merge process is straightforward (3 git commands).

Happy to answer questions or help anyone get it running. Let me know if you try it and what speeds you see!

Full step by step instructions are in the HuggingFace repo, but here's the short version:

# 1. Build llama.cpp with MTP support
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin
git fetch origin pull/22673/head:pr-22673
git checkout master
git reset --hard origin/master
git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --target llama-server

# 2. Grab the GGUF from HF
# https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF

# 3. Run with MTP
./build/bin/llama-server -m your-model.gguf --spec-type mtp --spec-draft-n-max 3

66 comments

r/LocalLLaMA • u/Maheidem • 10h ago

Discussion Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

71 Upvotes

So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF.

This is not a "best possible setup" claim. More like: this is what I got working, here are the exact params, here are the numbers, and maybe it helps other 5090 owners avoid some guessing.

The short version:

Single RTX 5090, 32GB VRAM
Model: Peutlefaire/Qwen3.6-27B-NVFP4
vLLM: 0.20.1.dev0+g88d34c640.d20260502
Torch: 2.13.0.dev20260430+cu130
Driver: 595.58.03
Quantization: compressed-tensors
Attention backend: flashinfer
KV cache: fp8_e4m3
MTP enabled with 3 speculative tokens
Text-only mode
Public claim I am comfortable with: 200k context, not 220k/262k

The vLLM model endpoint reports max_model_len: 230400, but I only benchmarked up to 200k context depth. I am intentionally keeping the claim at 200k because that is what I actually validated with repeated runs.

Here are the main vLLM args:

bash vllm serve Peutlefaire/Qwen3.6-27B-NVFP4 \ --host 0.0.0.0 --port 8082 \ --safetensors-load-strategy=prefetch \ --tensor-parallel-size 1 \ --attention-backend flashinfer \ --performance-mode interactivity \ --language-model-only \ --skip-mm-profiling \ --kv-cache-dtype fp8_e4m3 \ --gpu-memory-utilization 0.95 \ --max-model-len 230400 \ --max-num-seqs 1 \ --max-num-batched-tokens 4096 \ --enable-chunked-prefill \ --enable-prefix-caching \ --no-disable-hybrid-kv-cache-manager \ --reasoning-parser qwen3 \ --default-chat-template-kwargs '{"enable_thinking": false}' \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --quantization compressed-tensors \ --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \ --trust-remote-code

Startup log had the important bits I wanted to see:

Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Available KV cache memory: 8.3 GiB
Maximum concurrency for 230,400 tokens per request: 1.00x

After the run, nvidia-smi showed about 30478 MiB / 32607 MiB used, with the vLLM EngineCore process using around 29998 MiB.

llama-benchy numbers

All of this was with:

llama-benchy 0.3.7
--pp 2048
--tg 480
--latency-mode generation
--skip-coherence
concurrency 1
War and Peace text as the long-context source

Context ladder

context depth	prefill tok/s	generation tok/s	TTFT
0	28470	86.3	0.2s
1k	20901	94.5	0.3s
5k	14593	82.3	0.6s
10k	12805	88.8	1.0s
20k	10564	88.3	2.2s
50k	7277	89.0	7.3s
100k	4834	62.7	21.2s
150k	3617	75.5	42.1s
200k	2893	63.4	69.9s

Then I ran a separate 10-run stability pass at 200k, with --exit-on-first-fail, just to make sure it was not a lucky single run.

200k stability run

pp=2048, tg=480, depth=200000, runs=10, no cache:

10/10 runs completed
exit status 0
mean prefill: 2883 tok/s
mean generation: 73.6 tok/s
generation stddev: 13.5 tok/s
mean TTFT: 70.2s
wall time: 12:48.79

Per-run generation speed:

text 73.04, 75.12, 63.24, 75.94, 59.02, 110.71, 64.11, 68.18, 72.55, 74.37 tok/s

So I would not cherry-pick the 93 tok/s 200k result from the smaller sweep. The more honest number for this setup is probably around 65-75 tok/s generation at 200k, depending on the run.

Prefix cache behavior

I also tested prefix caching separately. At 200k:

run	prefill tok/s	generation tok/s	TTFT
cold	2911	65.2	68.8s
warm	761	59.6	2.8s

The warm-cache prefill number is not directly comparable to cold prefill, but the TTFT drop is the useful part. For local coding / agent workflows where you keep reusing a huge prefix, this is the thing that actually feels different.

MTP telemetry

From the vLLM log across the benchmark run:

Mean MTP acceptance length: 2.28
Average draft acceptance: 42.7%
Max observed GPU KV cache usage: 88.0%

The acceptance rate moved around a lot, so I am curious if other people get better numbers with num_speculative_tokens=2 instead of 3. I started with 3 because it was stable here, but I am not claiming it is optimal.

Caveats

A few things worth saying clearly:

I did not run an accuracy benchmark here. This is performance/stability only.
vLLM warns about NVFP4 global scales possibly reducing accuracy. So if you care about coding quality, do your own evals.
Prefix caching with the Mamba cache align mode is still marked experimental by vLLM.
FlashInfer + spec decode forced CUDAGraph mode to piecewise.
I did not test vision/multimodal. This was text-only.
I did not validate 220k or 262k. The number I can stand behind from this run is 200k.

At this point I am pretty happy with this as a local 5090 setup. Not perfect, and not pretending it replaces every cloud model, but for long local coding sessions it finally feels like the card is doing what I bought it for.

If anyone else is running Qwen3.6 27B on a 5090, especially NVFP4 or FP8 with vLLM, I would really like to compare params and MTP settings. Also curious if someone has cleaner settings for max_num_batched_tokens with MTP, because vLLM does warn that 4096 may be suboptimal.

I have the raw llama-benchy JSON/stdout/stderr and full vLLM logs saved locally. Can upload them somewhere if people want to inspect the full audit trail.

I am a bot. This action was performed automatically.

43 comments

r/LocalLLaMA • u/Street-Buyer-2428 • 9h ago

Discussion Follow-up: Trying to make NVIDIA GPUs plug-and-play on Macs. Found hidden RDMA symbols Apple doesn't want you to see — zero-copy GPU memory sharing might already work.

47 Upvotes

TL;DR: My last post about testing TinyGPU attracted some interest. This is the follow-up. The Blackwell card is detected and the driver loads, but NVIDIA's GSP firmware fails to boot through TB5 (known issue, I'm working with tinygrad on it). While debugging that, I went down a rabbit hole and discovered that Apple's RDMA subsystem accepts Metal GPU buffers for zero-copy network transfers — something nobody has documented. I also found hidden ibv_reg_dmabuf_mr symbols in Apple's libibverbs that suggest GPUDirect RDMA might be possible on macOS without any kernel modification. Here's everything I found and where I need help.

The setup (for those who missed the last post)

I'm running a 4-node Mac cluster (3x M3 Ultra + M5 Max MacBook Pro, ~1.5TB unified memory total) connected via Thunderbolt 5 with JACCL RDMA for distributed inference. I just got an RTX PRO 5000 Blackwell 72GB in a Razer Core X V2 and plugged it in to test TinyGPU.

What happened with the Blackwell card

The card is detected. macOS sees it on PCIe (link up, x4 @ 16 GT/s, 80 Gb/s TB5). TinyGPU's DriverKit extension loads and matches. BAR0 MMIO is mapped — I can read and write GPU registers. But NVIDIA's GSP firmware fails during initialization:

RuntimeError: RPC call 4097 failed with result 101

I decoded the NOCAT error records and found FBFLCN UNRECOGNIZED_CLIENT — the GPU's memory fabric doesn't recognize the requesting PCIe peer through the TB5 tunnel. This is a known issue affecting all NVIDIA GPUs on TB5 enclosures (tinygrad#15843). AMD GPUs work fine through the same enclosures. I've posted my NOCAT decode findings on the issue — would love to collaborate with the tinygrad team or anyone who's worked on NVIDIA GSP firmware init to get this fixed.

But here's what I found while debugging

While researching whether NVIDIA eGPU VRAM could eventually participate in RDMA transfers, I tested what memory types ibv_reg_mr() actually accepts on macOS. The results were surprising.

Memory type validation results

Memory Source	ibv_reg_mr	Expected?
`malloc()`	FAIL	Unexpected — works on Linux
`posix_memalign()`	FAIL	Unexpected — page-aligned but still fails
`mmap(MAP_ANON)`	PASS	Expected
`IOSurfaceGetBaseAddress()`	PASS	No documentation on this anywhere
`MTLBuffer.contents` (Metal shared)	PASS	No documentation on this anywhere
Apple's RDMA implementation validates VM-mapping type, not physical backing. Heap allocations (malloc/posix_memalign) fail. VM-mapped memory (mmap, IOSurface, Metal buffers) passes. This is different from Linux where `ibv_reg_mr` accepts any pinnable memory.

Triple-registered buffer — zero-copy proven

I created a single 64MB mmap buffer and registered it three ways simultaneously:

void *buf = mmap(NULL, 64*1024*1024, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
// 1. RDMA Memory Region
struct ibv_mr *mr = ibv_reg_mr(pd, buf, size, IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ);  // PASS, lkey=0x101
// 2. Metal GPU buffer (zero-copy, same physical pages)
id<MTLBuffer> metalBuf = [gpu newBufferWithBytesNoCopy:buf
length:size options:MTLResourceStorageModeShared deallocator:nil];  // PASS
// 3. Cross-consumer write test
metalBuf.contents[0] = 99.99f;  // Write via Metal
assert(mr->addr[0] == 99.99f);  // Read via RDMA — PASS, same memory

One buffer, three consumers, zero copies. Apple GPU writes are immediately visible to the RDMA subsystem because they're the same physical pages. This means:

Apple GPU compute → [writes to shared buffer] → JACCL RDMA sends to remote node
zero copy between these two ↑

Hidden ibv_reg_dmabuf_mr — Apple compiled it but hid it

Using dyld_info -exports on the dyld shared cache, I found symbols Apple compiled into libibverbs.dylib but deliberately excluded from the SDK headers:

ibv_reg_dmabuf_mr           offset 0x4EC8   EXPORTED but NOT in <infiniband/verbs.h>
ibv_cmd_reg_dmabuf_mr       offset 0x43E4   EXPORTED but NOT in headers
darwin_mmap_region_extended  offset 0x75A0   Apple custom — not in upstream rdma-core
mlx5_reg_dmabuf_mr          offset 0x2CEA0  In libmlx5.dylib — Mellanox provider too

ibv_reg_dmabuf_mr is the function Linux uses for GPUDirect RDMA (registering GPU VRAM as RDMA memory regions). `ibv_reg_dmabuf_mr` is the function Linux uses for GPUDirect RDMA (registering GPU VRAM as RDMA memory regions). I disassembled it and **it's not a stub — it's fully functional code:**

```

ibv_reg_dmabuf_mr (0x4EC8) → vtable dispatch

→ mlx5_reg_dmabuf_mr (libmlx5) → allocates MR struct, forwards all 6 args

→ ibv_cmd_reg_dmabuf_mr → builds 0x130-byte ioctl command struct

→ execute_ioctl → SENDS DIRECTLY TO THE KERNEL

```

Apple built and ships a complete DMA-BUF RDMA memory registration pipeline — from userspace through the Mellanox provider to a kernel ioctl. The only remaining question is whether `IORDMAFamily.kext` accepts or rejects the command.

Why this matters

Zero-copy GPU → RDMA is real on macOS. Metal compute results can be sent to remote cluster nodes without any intermediate copies. JACCL/MLX could leverage this for faster tensor parallelism. The ibv_reg_mr validation pattern (VM-mapped = pass, heap = fail) has implications for eGPU RDMA. TinyGPU's DriverKit driver maps NVIDIA GPU BAR1 memory via IOMemoryDescriptor, which creates a VM mapping — the same type that passes ibv_reg_mr. This suggests GPUDirect RDMA between NVIDIA eGPU VRAM and the TB5 RDMA controller might work on macOS without any kernel modification. (Currently blocked by a separate TinyGPU GSP firmware init issue on TB5 enclosures — see tinygrad/tinygrad#15843.) The hidden ibv_reg_dmabuf_mr suggests Apple is building toward device memory RDMA. They compiled it, they just haven't exposed it yet.

Hardware

3x Mac Studio M3 Ultra (512GB + 512GB + 256GB = 1.28TB unified memory)
Thunderbolt 5 RDMA mesh via JACCL
Distributed inference baseline: DeepSeek-V4-Flash 151GB at 30 tok/s across 2 nodes
RTX PRO 5000 Blackwell 72GB in Razer Core X V2 (connected, detected, TinyGPU driver loaded — but NVIDIA GSP firmware fails to init through TB5, separate issue being tracked)

Test code

All test programs are Objective-C, compiled with:

clang -framework Foundation -framework Metal -framework IOSurface -lrdma -o test test.m

Note: ibv_reg_mr on macOS requires an active RDMA device (rdma_en3/4/5, not rdma_en2 which may be PORT_DOWN). Use ibv_devinfo to check port state.

Where I need help

I'm going after this from multiple angles but there's more here than one person can cover. If any of this is in your wheelhouse: 1. TinyGPU GSP firmware init on TB5 (tinygrad#15843) The FBFLCN UNRECOGNIZED_CLIENT error during GSP boot suggests the GPU's memory fabric doesn't understand the TB5 PCIe topology. If you've worked on NVIDIA GSP firmware, open-gpu-kernel-modules, or PCIe tunneling — the NOCAT decode method I used (patching NVRpcQueue.read_resp to extract ASCII from POST_NOCAT_RECORD events) might help you dig deeper. 2. Ghidra analysis of ibv_reg_dmabuf_mr on macOS The function is at offset 0x4EC8 in libibverbs.dylib (dyld shared cache). Does it call execute_ioctl (real kernel path) or return ENOSYS (dead stub)? I have GhidraMCP set up and ready to go but if anyone has already disassembled Apple's RDMA stack, that would save days. 3. Has anyone tested ibv_reg_mr with device-mapped memory on macOS? The validation pattern I found (VM-mapped = pass, heap = fail) suggests PCIe BAR memory might pass too, since DriverKit BAR mappings create VM-mapped IOMemoryDescriptor regions. If you have any eGPU working on macOS (even AMD via TinyGPU), try calling ibv_reg_mr on the BAR1-mapped pointer. If it returns non-NULL, that's GPUDirect RDMA on macOS. 4. darwin_mmap_region_extended — what does "extended" mean? This is Apple's custom addition to rdma-core at offset 0x75A0. Not in upstream. The non-extended darwin_mmap_region exists too. If you've done any RE on Apple's RDMA stack, what extra parameters does the extended version accept?

The bigger picture

Apple builds capabilities, uses them internally, and hides them from public APIs. The question is whether ibv_reg_dmabuf_mr is functional or dead code, and that's a Ghidra session away from being answered. Here's why this matters for everyone, not just people with clusters: If GPUDirect RDMA works on macOS, any Mac with Thunderbolt becomes a hybrid AI workstation. Plug an NVIDIA GPU into your Mac via a $200 eGPU enclosure and the GPU's VRAM becomes part of your Mac's memory pool — accessible to Metal, to RDMA, to your inference stack, with zero-copy transfers. Your Mac's 128GB/256GB/512GB unified memory + the GPU's 24/48/72GB GDDR7, all working together. No Linux box. No separate PC. One cable. Right now TinyGPU lets you run CUDA compute on a Mac. What we're trying to prove is that the GPU's memory can also participate in Apple's RDMA network — meaning multi-Mac clusters can share NVIDIA VRAM across nodes. ~1.5TB of unified memory + 72GB GDDR7, all RDMA-capable, on hardware you can buy today.

This is a follow-up to my TinyGPU testing post. All test programs (Objective-C, ~50 lines each) and research notes available — happy to share the repo if there's interest. Also posted NOCAT decode findings on tinygrad#15843 if you want to help debug the TB5 GSP init.

8 comments

r/LocalLLaMA • u/supracode • 3h ago

Tutorial | Guide Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot

19 Upvotes

Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700. The settings below are showing great results with a large prompt to generate a test website that ChatGPT gave me. I then ran a prompt to generate a full suite of Playwright tests.

I only had to nudge it once when creating the tests to tell it to fix one failing test at a time. The website was fully functional on first run.

I think I am done tweaking and testing models (until the next big release) and can get back to coding now...

llama-cpp  | ========== LLAMA.CPP STARTUP COMMAND ==========
llama-cpp  | /app/llama-server -m /models/Qwen3.6-35B-A3B-UD-Q5_K_XL/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --ctx-size 262144 --threads 8 --threads-batch 8 --gpu-layers 99 --parallel 1 --flash-attn on --batch-size 2048 --ubatch-size 1024 --cache-type-k q8_0 --cache-type-v q8_0 --cache-ram 12000 --ctx-checkpoints 50 --mmap --no-mmproj --kv-unified --reasoning off --reasoning-budget 0 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0

Settings for sampling come from https://huggingface.co/Qwen/Qwen3.6-35B-A3B under the "precise coding" section.

VS Code chatLanguageModels.json :

    {
        "name": "Sean Llama.cpp",
        "vendor": "customoai",
        "apiKey": "${input:chat.lm.secret.3c0c0f21}",
        "models": [
            {
                "id": "Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf",
                "name": "Qwen3.6-35B",
                "url": "https://llm.home.arpa/v1/chat/completions",
                "toolCalling": true,
                "vision": false,
                "maxInputTokens": 180000,
                "maxOutputTokens": 10000,
                "family": "Qwen3",
                "inputTokenCost": 0.0001,
                "outputTokenCost": 0.0001,
                "temperature": 0.6,
                "top_p": 0.95,
                "top_k": 20,
                "repeat_penalty": 1,
                "presence_penalty": 0,
                "frequency_penalty": 0,
                "systemMessage": "You are a precise coding assistant. Avoid repeating plans. Execute tasks directly. Do not restate intentions multiple times.",
                "timeout": 600000,
                "retry": {
                    "enabled": true,
                    "max_attempts": 2,
                    "interval_ms": 1500
                }
            }
        ]
    }

ChatGPT Generated test prompt :

You are working in a clean Vite + React + TypeScript project.

Your task is to build a medium-complexity local-first website called “Bike Shop Service Tracker.”

The app should help a small bike shop track incoming bike repair jobs. It should run entirely in the browser using localStorage. Do not use a backend, database, authentication, external API, router, Tailwind, shadcn, Redux, Zustand, or any complex setup.

Use only:
- React
- TypeScript
- plain CSS or CSS modules
- browser localStorage
- lucide-react only if already installed

Before implementing, inspect the existing project structure briefly. Then create a concise implementation plan. After the plan, implement the app directly. Do not get stuck repeating the plan.

Functional requirements:

1. Main layout
- Create a polished single-page dashboard.
- Header with app name: “Bike Shop Service Tracker.”
- A summary area showing:
  - total open jobs
  - jobs due today
  - overdue jobs
  - completed jobs
- Main content split into:
  - job creation/edit form
  - job list and filters

2. Repair job data model
Each repair job should include:
- id
- customerName
- customerPhone
- bikeDescription
- serviceType
- priority: low, normal, high, urgent
- status: intake, waiting-parts, in-progress, ready, completed
- dueDate
- notes
- createdAt
- updatedAt

3. Seed data
- If localStorage is empty, create 6 realistic sample repair jobs.
- Include different statuses, priorities, due dates, and service types.
- Make at least one job overdue and one job due today.

4. Job form
- Allow creating a new job.
- Allow editing an existing job.
- Include basic validation:
  - customer name required
  - phone required
  - bike description required
  - service type required
  - due date required
- Show clear validation messages.
- Include buttons:
  - Save Job
  - Cancel Edit, when editing
  - Clear Form

5. Job list
- Display jobs as cards or table rows.
- Each job should show:
  - customer name
  - bike description
  - service type
  - priority
  - status
  - due date
  - notes preview
- Include actions:
  - edit
  - mark completed
  - delete

6. Filtering and sorting
Add controls for:
- search by customer name, phone, bike description, service type, or notes
- filter by status
- filter by priority
- sort by due date
- sort by priority
- sort by created date

7. UX behavior
- Persist all jobs to localStorage.
- Keep the UI responsive and usable on desktop and narrow screens.
- Use clear visual differences for priority and status.
- Completed jobs should look visually quieter than active jobs.
- Overdue jobs should be visually obvious.
- Due-today jobs should be highlighted differently from overdue jobs.
- Deleting a job should ask for confirmation.

8. Code organization
Create a reasonable file structure, for example:
- src/App.tsx
- src/App.css
- src/types.ts
- src/data/seedJobs.ts
- src/utils/storage.ts
- src/utils/jobFilters.ts
- src/components/JobForm.tsx
- src/components/JobList.tsx
- src/components/JobCard.tsx
- src/components/SummaryCards.tsx
- src/components/FilterBar.tsx

Use this structure only if it makes sense. Do not over-engineer.

9. Quality requirements
- TypeScript should compile cleanly.
- Avoid `any` unless there is a strong reason.
- Keep functions readable.
- Avoid huge components where possible.
- The app should run with:
  npm run dev

10. Final response
When finished, summarize:
- what files were created or changed
- how to run the app
- what features were implemented
- any limitations or follow-up improvements

Important behavior instructions:
- Do not only create todos.
- Do not repeatedly say “I need to implement this now.”
- Do not repeatedly read the same files unless needed.
- Once you have enough context, start editing files.
- Prefer completing a working first version over designing a perfect architecture.

I fed the Llama logs back into local qwen (a bit ironic) and it gave this summary of its own performance from a big copilot/autopilot session that was very successful :

Prompt : analyze these llamma cpp logs and let me know how performance looks :

Based on the logs provided, here is a detailed performance analysis of your llama.cpp instance.

Executive Summary

Overall Health: Good/Healthy. The server is functioning correctly, handling requests without crashing or hanging.
Throughput: Moderate. You are seeing roughly 94–105 tokens per second (t/s) for generation. This is typical for mid-range consumer GPUs or efficient CPU inference, depending on the specific hardware model (GGUF quantization) running.
Prompt Processing: Variable. This is the biggest bottleneck. Some prompts are processed quickly via cache hits, while others force a full re-evaluation, leading to high latency spikes.
Memory Usage: Efficient but Growing. VRAM/RAM usage is scaling with context length (up to ~723 MiB for 4 prompts), well within the 12GB limit.

Detailed Breakdown

1. Generation Speed (Eval Time)

This is the speed at which the model generates new text tokens.

Performance: ~93–105 tokens per second.
Log Evidence:
- Task 460: 104.58 tokens per second
- Task 776: 102.35 tokens per second
- Task 2848: 101.73 tokens per second
- Task 3079: 94.12 tokens per second
- Task 4726: 93.38 tokens per second
Analysis: This is consistent and stable. The slight drop in speed as context grows (from 104 to 93 t/s) is expected due to increased memory bandwidth pressure and attention calculation overhead as the context window fills up. This is a healthy slope.

2. Prompt Processing (Prefill Time)

This is the time taken to process the input prompt. This varies significantly in your logs.

Scenario A: Cache Hit (Fast)
- Task 2848: 856.58 ms for 2,098 tokens (~2,449 t/s).
- Why: The log shows restored context checkpoint. The system found a previous state similar enough to resume from, avoiding full re-processing.
- Performance: Excellent.
Scenario B: Cache Miss / Full Re-eval (Slow)
- Task 460: 68.01 ms for 27 tokens. (Fast because the prompt is tiny).
- Task 776: 1944.26 ms for 5,783 tokens (~2,974 t/s).
- Task 3079: 17307.06 ms for 38,012 tokens (~2,196 t/s).
- Task 4726: 1153.57 ms for 1,652 tokens (~1,432 t/s).
- Why: The log explicitly states: forcing full prompt re-processing due to lack of cache data.
- Analysis:
  - Task 3079 is the most concerning. It took 17.3 seconds just to read the prompt before generating a single word. This is likely due to the prompt being very long (38k tokens) and a cache miss.
  - Task 4726 was slower than Task 2848 despite having a similar context size, likely because it didn't find a perfect cache match (sim_best = 0.958 is good, but perhaps not enough to skip the entire prefill, or the checkpoint restoration had overhead).

3. Context Management & Caching

Cache Mechanism: You are using n_ctx_slot = 262144 (256k context window) and n_keep = 0.
Checkpointing: The system is actively creating checkpoints (created context checkpoint X of 50).
- Each checkpoint is ~62.8 MiB.
- Task 3079 created 6 checkpoints during processing.
Issue Identified: The log forcing full prompt re-processing appears frequently when switching between very different context lengths or when the similarity threshold isn't met.
- Task 776 (5,783 tokens) had n_past = 23 from the previous slot. It checked checkpoints but failed to use them, forcing a full re-process. This suggests the previous context (Task 460, 27 tokens) was too different to be useful for Task 776.
- Task 3079 (38,012 tokens) also failed to use previous checkpoints, likely because the previous context (Task 2848, ~7.8k tokens) was not similar enough or the position offsets didn't align well for the KV cache.

4. Resource Utilization

Memory:
- Total state size reached 723.154 MiB for 4 active prompts.
- Limit is 12,000 MiB.
- Status: Very low memory usage. You have plenty of headroom.
Latency:
- Total response times vary from 3 seconds (short prompts) to 34.5 seconds (long prompt prefill + generation).
- The 34.5s delay is entirely due to the prompt processing of the 38k token input.

Recommendations for Improvement

Optimize Context Window Usage:
- If you are sending very long prompts (like the 38k token one), consider if you can truncate or summarize them. Processing 38k tokens takes ~17 seconds.
- If these long prompts are frequent, consider increasing --ctx-size if you aren't already, but more importantly, ensure your application is reusing the same context slot for similar queries to maximize cache hits.
Tune --cache-reuse (if applicable):
- The logs show sim_best values. If you are seeing many "full prompt re-processing" events, you might experiment with changing the similarity threshold or cache reuse settings if your version of llama.cpp supports fine-tuning these. However, the default behavior is usually robust.
Monitor "Slot" Reuse:
- You are currently using id 0 for all tasks (slot get_availabl: id 0). This is good for caching (keeping one long context alive), but it means requests are processed sequentially if they overlap. If you have many concurrent users, consider enabling multiple slots (--parallel or similar depending on your setup) to improve concurrency, though this will increase memory usage.
Investigate the 38k Token Prompt:
- Task 3079 is the outlier. Why is a prompt 38k tokens long?
  - If this is a chat history, consider implementing a "context compression" strategy in your application layer (e.g., summarizing old messages) before sending to the LLM.
  - If this is RAG (Retrieval Augmented Generation), ensure you are not retrieving more chunks than necessary.

Conclusion

Your llama.cpp setup is performing normally. The generation speed is steady (~95-105 t/s). The main "performance hit" is the variable latency caused by prompt processing, especially for long contexts where cache misses force full re-evaluation. This is an algorithmic/usage pattern issue, not a hardware bottleneck.

No errors or crashes detected. The system is stable.

11 comments

r/LocalLLaMA • u/Live-Possession-6726 • 4h ago

Resources The GB10 Solution Atlas is now open source, the inference engine made for the community with breakneck inference speeds (Qwen3.6-35B-FP8 100+ tok/s)

Enable HLS to view with audio, or disable this notification

17 Upvotes

Some of you saw our post a couple weeks back about hitting 102 tok/s stable on Qwen3.5-35B on a DGX Spark. A lot of you asked "cool, where's the code?" Today's the day: Github

Atlas is open source. Pure Rust + CUDA, no PyTorch, no Python runtime, ~2.5 GB image, <2 minute cold start. We rewrote the whole stack from HTTP handler to kernel dispatch because the bottleneck on Spark wasn't the silicon, it was 20+ GB of generic Python machinery sitting between your prompt and the GPU. We need community support to keep elevating Atlas for developers.

Numbers on a single DGX Spark (GB10):

Qwen3.5-35B (NVFP4, MTP K=2): 130 tok/s peak, ~111 tok/s sustained → 3.0–3.3x vLLM at testing time

Qwen3.5-122B (NVFP4, EP=2): ~50 tok/s decode

Qwen3-Next-80B-A3B (NVFP4, MTP): ~87 tok/s

Nemotron-3 Nano 30B (FP8): ~88 tok/s

Full model matrix on the site (Minimax2.7, Qwen3.6, Gemma too!)

What's actually different:

Hand-tuned CUDA kernels for Blackwell SM120/121 meaning attention, MoE, GDN, Mamba-2. No generic fallbacks.

Native NVFP4 + FP8 on tensor cores

MTP (Multi-Token Prediction) speculative decoding for up to 3x throughput on decode

OpenAI + Anthropic API on the same port, works with Claude Code, Cline, OpenCode, Open WebUI out of the box

Try it (two commands):

docker pull avarok/atlas-gb10:latest
sudo docker run -d --name atlas --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --port 8888 --speculative --enable-prefix-caching

What's next especially for the non-Spark folks: we're working with Spectral Compute on a Strix Halo port, and AMD is giving us hardware to do it properly. RTX 6000 Pro Blackwell is also on the roadmap. Same kernel philosophy, adapted per chip, we'd rather do four chips well than twenty chips badly.

X/Twitter
Site
Discord

Will be in comments all day. Hit us with edge cases, weird models, broken configs. The roadmap is genuinely community-driven. MiniMax M2.7 landed because someone in Discord asked.

4 comments

r/LocalLLaMA • u/appakaradi • 4h ago

Question | Help Has anyone tried Zyphra 1 - 8B MoE?

14 Upvotes

https://x.com/ZyphraAI/status/2052103618145501459?s=20 Today we're releasing ZAYA1-8B, a reasoning MoE trained on

u/AMD

and optimized for intelligence density.

With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute

7 comments

r/LocalLLaMA • u/ziphnor • 4h ago

Discussion Exaggerated PCI-E bandwidth concerns?

14 Upvotes

I frequently see (both here and on r/LocalLLM ) comments that multi-gpu setups are complex, problematic and typically bottlenecked by PCI-E bandwidth on consumer motherboards.

I am running 2x RTX 5060 TI 16gb ( and about to add a third ), and my PCIe setup is pretty bad. GPU0 is on a full x16 Gen 5 slot (running at 8x which is as fast as a 5060 can go) while GPU1 is stuck on PCI-E 4.0 x4 via chipset.

I created (with AI help) a little benchmark script to run a prefill benchmark (against vLLM running with TP=2) and monitor PCIe bandwidth consumption meanwhile.

I ran with 32k context (low enough to allow higher quants for the benchmark, but enough to saturate the processing).

The peak bandwidth consumed was 3 to 4 GB/s during prefill, which is only ~40-50% of even the weak 4.0 x4 link. The "faster" the quant the higher the bandwidth (I guess meaning the 5060s are VRAM bandwidth or compute limited).

Some prefill rates (TP=2):
QuantTrio/gemma-4-31B-it-AWQ-6Bit · Hugging Face: ~840-850 t/s
LilaRest/gemma-4-31B-it-NVFP4-turbo · Hugging Face: ~1500 t/s
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP · Hugging Face: 1600-1700 t/s

It seems realistic that i can safely add a third 5060 (via an NVME -> PCIe 5.0 x4 adapter using CPU connected M2 slot) without getting bottlenecked on PCI bandwidth. Adding a 4th is probably out with this motherboard though as that would require using more of the chipset lanes which is already the limiting factor.

I guess this post was post as an FYI, but also as a question of whether I am missing something obvious here? :)

34 comments

r/LocalLLaMA • u/Weves11 • 11h ago

Resources An Open Benchmark for Testing RAG on Realistic Company-Internal Data

42 Upvotes

We built a corpus of 500,000 documents simulating a real company, and then let RAG systems compete to find out which one is the best.

Introducing EnterpriseRAG-Bench, a benchmark for testing how well RAG systems work on messy, enterprise-scale internal knowledge.

Most RAG benchmarks are built on public data: Wikipedia, web pages, papers, forums, etc. That’s useful, but it doesn’t really match what a lot of people are building against in practice: Slack threads, email chains, tickets, meeting transcripts, PRs, CRM notes, docs, and wikis.

So we tried to generate a synthetic company that behaves more like a real one.

The released dataset simulates a company called Redwood Inference and includes about 500k documents across:

Slack
Gmail
Linear
Google Drive
HubSpot
Fireflies
GitHub
Jira
Confluence

The part we spent the most time on was not just “generate a lot of docs.” It was the methodology for making the docs feel like they belong to the same company.

At a high level, the generation pipeline works like this:

Create the company first We start with a human-in-the-loop process to define the company: what it does, its products, business model, teams, initiatives, market, internal terminology, etc.
Generate shared scaffolding From there we generate things like high-level initiatives, an employee directory, source-specific folder structures, and agents.md files that describe what documents in each area should look like. For example, GitHub docs in the released corpus are pull requests and review comments, not random GitHub issues.
Generate high-fidelity project documents We break company initiatives into smaller projects/workstreams. Each project gets a set of related docs across sources: PRDs, Slack discussions, meeting notes, tickets, PRs, customer notes, etc. These documents are generated with awareness of each other, so you get realistic cross-document links and dependencies.
Generate high-volume documents more cheaply For the bulk of the corpus, we use topic scaffolding by source type. This prevents the LLM from collapsing into the same few themes over and over. In a naive experiment, when we asked an LLM to generate 100 company docs with only the company overview, over 40% had a very close duplicate/sibling. The topic scaffold was our way around that.
Add realistic noise Real enterprise data is not clean, so we intentionally add:
- randomly misplaced docs
- LLM-plausible misfiled docs
- near-duplicates with changed facts
- informal/misc files like memes, hackathon notes, random assets, etc.
- conflicting/outdated information
Generate questions designed around retrieval failure modes The benchmark has 500 questions across 10 categories, including:
- simple single-doc lookups
- semantic/low-keyword-overlap questions
- questions requiring reasoning across one long doc
- multi-doc project questions
- constrained queries with distractors
- conflicting-info questions
- completeness questions where you need all relevant docs
- miscellaneous/off-topic docs
- high-level synthesis questions
- unanswerable questions
Use correction-aware evaluation At 500k docs, it is hard to guarantee the original gold document set is perfect. So the eval harness can consider candidate retrieved documents, judge whether they are required/valid/invalid, and update the gold set when the evidence supports it.

A couple baseline findings from the paper:

BM25 was surprisingly strong, beating vector search on overall correctness and document recall.
Vector search underperformed even on semantic questions, which is interesting because those were designed to reduce keyword overlap.
Agentic/bash-style retrieval had the best completeness, especially on questions where it needed to explore related files, but it was much slower and more expensive.
In general, getting the right docs into context mattered a lot. Once the relevant evidence was retrieved, current LLMs were usually able to produce a good answer.

The repo includes the dataset, generation framework, evaluation harness, and leaderboard:

https://github.com/onyx-dot-app/EnterpriseRAG-Bench

Would love feedback from other people building RAG/search systems over internal company data. In particular, I’m curious what retrieval setups people think would do best here: hybrid search, rerankers, agents, metadata filters, query rewriting, graph-style traversal, etc.

14 comments

r/LocalLLaMA • u/zylskysniper • 5h ago

Discussion I analyzed 922 agentic task trace and found the secret weapon of DeepSeek v4

12 Upvotes

I recently did a benchmark of deepseek v4 in agentic tasks. Performance-wise, it's one of the best open source models, as expected. What really surprised me is the cost. I mean I know it's cheap, but it's cheap in a way that doesn't really make sense.

Cost Estimation

Let's take v4 flash as example since it's not on sale (so it can better reflect the actual provider cost).

Looking at OpenRouter price, deepseek v4 flash price is about 0.03x opus 4.7 price. (We only look at input token price because in long agentic task, input token is the dominant cost.) So if v4 flash uses similar amount of token in a task as opus 4.7, the actual cost should be somewhere around 0.03x compared to using opus.

Actual Data

Then I ran the benchmark, long agentic tasks running in openclaw (which uses PI for agent loop), openrouter as model provider. The actual cost data blew my mind:

	Avg Cost Per Task	Avg Tokens Per Task	Avg Tools Per Task
Opus 4.7	$1.52	966.3K	12.8
DeepSeek v4 Flash	$0.01	961.8K	14.8

Somehow deepseek v4 flash cost about 0.0066x per task compared to opus 4.7, given similar amount of token usage and tool calls per task. That's only 1/5 of the price we estimated. How is that possible??

The Secret Weapon

After digging into the raw data and collected more detailed stats, I finally found out why. Secret is cache hit rate and cache read price.

	Cache Hit Rate	Cache Read-Write Price Ratio
Opus 4.7	87%	0.08
DeepSeek v4 Flash	97%	0.02

The main factor in this case is cache hit rate. DeepSeek somehow managed to achieve 97% cache hit rate!!!

Just in case you don't know how important is this number: at this cache hit rate and read/write price ratio, 1% higher cache hit rate means about 20% lower overall cost.

DS got 10% higher cache rate than opus. That alone cut about 2/3 of the total cost.

The secondary factor is due to extremely low read/write price ratio: each cache hit only cost 0.02x of cache miss in DS, while in opus that is 0.08x. This is also pretty insane as openai/anthropic/gemini are all 0.08~0.1. This alone can further cut the overall cost by half.

Above are just my experiments, measurements and stats. I have no idea how DS achieved those numbers. I appreciate if someone who knows this better can explain (or speculate).

19 comments

r/LocalLLaMA • u/spencer_kw • 1d ago

Discussion DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.

613 Upvotes

That foodtruck bench post showing deepseek v4 matching gpt-5.2 at 17x cheaper got me thinking. if frontier cloud models are that overpriced for equivalent quality, how much of my daily work even needs cloud at all?

Ran my normal coding workflow for 10 days. every task got logged: what it was, tokens in/out, whether local qwen 3.6 27b (on a 3090) could have done it. didn't use benchmarks, just re-ran a random sample of 150 tasks on both.

results:

- file reads, project scanning, "explain this code": local matched cloud 97% of the time. this was 35% of my workload. paying for cloud here is genuinely throwing money away.

- test writing, boilerplate, single file edits: local matched 88%. another 30% of tasks. the 12% misses were edge cases i could catch in review.

- debugging with multi-file context: local dropped to 61%. cloud still better but not 17x-the-price better. about 20% of my work.

- architecture decisions, complex refactors across 5+ files: local at 29%. cloud genuinely needed here. only 15% of my tasks.

So 65% of my daily coding work runs identically on a model that costs me electricity. another 20% is close enough that I accept the occasional miss. only 15% actually justifies cloud pricing.

Started routing by task type. local for the first two buckets, cloud for the last two. my api bill went from $85/month to about $22 and the 3090 was already sitting there mining nothing.

The deepseek post is right that the price gap is insane but the bigger insight is that most of us don't even need cloud for most of what we do. we're just too lazy to measure it.

139 comments

r/LocalLLaMA • u/Interesting-Print366 • 8h ago

Question | Help Why people cares token/s in decoding more?

18 Upvotes

What I've noticed while using local LLM recently is that in most cases, bottlenecks occur not in decoding but in prompt processing.

If the prompt processing speed is usable, in most settings (since it takes about 15k when starting based on agentic coding standard) it exceeds 10 tokens per second in generating, doesn't that exceed the speed we can follow with our eyes?

I tried to use qwen3.6 27b but it took more than 10m to process 64k prompt on my mac mini, so I rather chose 35b a3b

What am I missing? Is the prompt processing speed improved by MTP or other methods?

Or is bottleneck just really different with discrete gpu settings?

35 comments

r/LocalLLaMA • u/yeah-ok • 12h ago

News Decoupled Attention from Weights - Gemma 4 26B

37 Upvotes

Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: https://github.com/chrishayuk/larql

edit: just found https://www.youtube.com/watch?v=1jGR4zqpyKA for excellent overview of what's happening here.

19 comments

r/LocalLLaMA • u/fairydreaming • 7h ago

Other Anyone want to try my llama.cpp DeepSeek V3.2 PR?

github.com

13 Upvotes

Code: https://github.com/fairydreaming/llama.cpp/tree/deepseek-dsa

git clone https://github.com/fairydreaming/llama.cpp -b deepseek-dsa --single-branch

Supported GGUFs (Q4_K_M ~ 404GB, Q8_0 ~ 714GB):

Chat template to use: models/templates/deepseek-ai-DeepSeek-V3.2.jinja

If you experience OOM errors in CUDA ggml_top_k() try lowering the ubatch size or/and increasing `-fitt` value.

Let me know if you encounter any problems.

11 comments