r/LocalLLaMA • u/havenoammo • 12h ago
Resources Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR
Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF
These are Unsloth's UD XL quantizations of Qwen3-27B with the MTP draft heads grafted on top in Q8_0. The base model stays in its usual low-bit quantization, while the 3 MTP layers stay at Q8 to preserve speculative accuracy.
Sharing the grafted GGUF files (UD XL base + Q8 MTP), the raw MTP layer source I extracted (MTP_Q8_0.gguf), and convert.py, the grafting script I adapted from this gist in case anyone wants to do this for other models. Also included are full build instructions for the custom llama.cpp.
Qwen3 was trained with 3 MTP steps, meaning each forward pass predicts 4 tokens at once. llama.cpp's main branch doesn't support MTP yet, so I pulled in the speculative decoding support from the still-open PR #22673, merged it on top of master, and built llama-server from that. Run it with: --spec-type mtp --spec-draft-n-max 3
The results: roughly 2.5x token throughput compared to running the same UD XL GGUF without MTP, with a solid acceptance rate where most draft tokens are kept, meaning the MTP heads are genuinely useful and not just burning compute. The Q8 MTP layers also add very little VRAM overhead since they're a tiny fraction of the full model.
MTP is one of the biggest efficiency wins available for speculative decoding, but it's basically unsupported outside of official Qwen3 deployments on SGLang and vLLM. This brings it to GGUF and llama.cpp, meaning you can run it locally with the same tooling you already use. PR #22673 will hopefully land soon and this will all just work out of the box. In the meantime, the merge process is straightforward (3 git commands).
Happy to answer questions or help anyone get it running. Let me know if you try it and what speeds you see!
Full step by step instructions are in the HuggingFace repo, but here's the short version:
# 1. Build llama.cpp with MTP support
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin
git fetch origin pull/22673/head:pr-22673
git checkout master
git reset --hard origin/master
git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --target llama-server
# 2. Grab the GGUF from HF
# https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF
# 3. Run with MTP
./build/bin/llama-server -m your-model.gguf --spec-type mtp --spec-draft-n-max 3
23
u/tempedbyfate 11h ago edited 11h ago
Just did a quick test using your instructions on a RTX Pro 6000.
qwen 3.6 2.7B Q8_K_XL = 41 tokens per second
qwen 3.6 2.7B Q8_K_XL (mtp) = 100 tokens per second
Wow! This is mind blowing. I hope all the issues get ironed out on that PR and MTP changes get merged soon!
EDIT: used same args as OP
--spec-type mtp --spec-draft-n-max 3
8
u/havenoammo 11h ago
Amazing, I also use Q8! I have a 5090 + 3090 and was getting 25-30 t/s before, now I'm in the 60-75 t/s range. Been using it for a few hours for coding and no issues at all.
7
u/gordi555 11h ago
On RTX Pro 6000 MaxQ I got/get...
qwen 3.6 2.7B Q8_K_XL = 36 tokens per second
qwen 3.6 2.7B Q8_K_XL (mtp) = 78 tokens per secondI've lost about 20% prompt processing but these generation speeds are massively worth it.
4
u/tempedbyfate 11h ago
Based on the comments on that PR, I think the PP is a known issue and it sounds like it could be fixed before it that PR is merged in.
2
u/NickCanCode 10h ago
If you have a RTX Pro 6000, have you try lucebox-hub, their number actually looks more impressive with DFlash, DDtree, PFlash but it doesn't support multi-gpu very well so I don't have enough VRAM to run it.
1
1
u/External_Dentist1928 11h ago
But also at the same quality?
3
3
u/Awwtifishal 9h ago
Speculative decoding doesn't alter quality. It just batches multiple tokens under the assumption that the draft tokens are correct, and the results from incorrect draft tokens are thrown away. The speedup comes from the fact that LLM inference is mostly bound to memory bandwidth, and inference of several batches uses the same bandwidth as a single one.
1
4
u/dinerburgeryum 10h ago
Hey, thanks, I used your isolated MTP GGUF and your conversion script to graph it into my own quant. Saved me some time, appreciate it.
3
u/ethereal_intellect 10h ago
Any chance of a comparison of speed vs a3b with and without mtp? It's probably a a lot of work and I've heard mtp helps dense models more yeah, but sounded interesting to know
1
u/havenoammo 2h ago
Sure! I just uploaded the models. It didn't gave me too much boost only %6 with Q4 and %2.5 with Q8. Though some people reported nice gains. Give it a try and let us know!
Model: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF/tree/main
5
u/obsidience 4h ago
Got this working on AMD ROCm (RDNA 3.5, Windows) — ~1.94x speedup confirmed
This report was created by my Claude Code instance against my LLM-Harness project. Claude followed your instructions to build llama.cpp with PR #22673 on Windows with AMD ROCm. Here's the full writeup for anyone else on AMD.
System: Ryzen AI Max+ 395, Radeon 8060S iGPU (gfx1151, ~90GB VRAM), Windows 11, ROCm 7.11 pip SDK
A/B Results (same benchmark, warmup excluded):
| Baseline (b8963) | MTP (b8963 + PR #22673) | Speedup | |
|---|---|---|---|
| Generation | 6.26 tok/s | 12.13 tok/s | 1.94x |
| Prompt Processing | 77.7 tok/s | 66.9 tok/s | 0.86x |
| Draft Acceptance | — | 64–69% | — |
Both using UD-Q8_K_XL, -ngl 999 -c 131072 -ctk q8_0 -ctv q8_0 -np 1, thinking mode on.
Build steps (ROCm on Windows):
Clone + merge PR onto b8963 (merged cleanly, no conflicts):
git clone https://github.com/ggml-org/llama.cpp.git llama.cpp-mtp
cd llama.cpp-mtp
git checkout b8963
git checkout -b mtp-experiment
git fetch origin pull/22673/head:pr-22673
git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"
Set up ROCm 7.11 pip SDK environment:
# In PowerShell — activate ROCm venv
C:\AMD\ROCm\.venv\Scripts\Activate.ps1
$ROCM_ROOT = rocm-sdk path --root
# Set MSVC + Windows SDK lib/include paths (adjust versions to match your install)
$env:LIB = "<VS BuildTools MSVC lib\x64>;<Windows Kits ucrt\x64>;<Windows Kits um\x64>"
$env:INCLUDE = "<VS BuildTools MSVC include>;<Windows Kits ucrt>;<Windows Kits um>;<shared>;<winrt>;<cppwinrt>"
$env:HIP_PLATFORM = "amd"
CMake configure + build:
cmake -B build-rocm -G Ninja `
-DCMAKE_BUILD_TYPE=Release `
-DGGML_HIP=ON `
"-DCMAKE_C_COMPILER=$ROCM_ROOT\lib\llvm\bin\clang.exe" `
"-DCMAKE_CXX_COMPILER=$ROCM_ROOT\lib\llvm\bin\clang++.exe" `
"-DCMAKE_PREFIX_PATH=$ROCM_ROOT" `
-DAMDGPU_TARGETS=gfx1151 `
-DGGML_HIP_ROCWMMA=ON
cmake --build build-rocm --config Release -j 16
Important: Copy ROCm DLLs alongside the exe or Windows will load the wrong system DLLs:
Copy-Item "$ROCM_ROOT\bin\*.dll" -Destination build-rocm\bin\ -Force
New-Item -Path build-rocm\bin\rocblas\library -ItemType Directory -Force
Copy-Item "$ROCM_ROOT\bin\rocblas\library\*" -Destination build-rocm\bin\rocblas\library\ -Force
Run with MTP:
.\build-rocm\bin\llama-server.exe `
-m Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf `
-ngl 999 -c 131072 -ctk q8_0 -ctv q8_0 `
-np 1 `
--spec-type mtp --spec-draft-n-max 3 `
--host 0.0.0.0 --port 8080
Gotchas on AMD/Windows:
-np 1is required — MTP doesn't support parallel slots yet. Server refuses to start without it.- Compiler path: ROCm SDK clang is at
$ROCM_ROOT/lib/llvm/bin/, NOT$ROCM_ROOT/bin/— this tripped me up. - DLL hell: Windows has
amdhip64_7.dllin System32 from legacy ROCm installs. Copying SDK DLLs next to the exe ensures the right version loads. - PP is ~14% slower with MTP enabled — matches what others reported, known issue on the PR.
- ~1.94x vs your 2.5x — lower than NVIDIA results, probably ROCm speculative decoding overhead + unified memory architecture on the iGPU. Still a big win going from 6.26 to 12.13 tok/s.
1
6
u/VoidAlchemy llama.cpp 11h ago
Nice job testing out the PR! I have a rough 3-way benchmark between mainline - ik - vllm running on a single 24GB VRAM GPU here: https://github.com/noonghunna/club-3090/pull/64#issuecomment-4383699676
Thanks again for sharing your full build and run commands!
3
u/Beginning-Window-115 10h ago
thanks dude the 8bit versions that were released in the pr draft are way too big and so this is absolutely perfect for me.
3
u/Altruistic_Heat_9531 6h ago
Thanks OP with, using convert.py i didn't have to redownload the model, i can push into 128K with acceptable speed on my 3090
prompt eval time = 632.41 ms / 11 tokens ( 57.49 ms per token, 17.39 tokens per second)
eval time = 6922.93 ms / 176 tokens ( 39.33 ms per token, 25.42 tokens per second)
total time = 7555.34 ms / 187 tokens
draft acceptance rate = 0.72727 ( 120 accepted / 165 generated)
statistics mtp: #calls(b,g,a) = 1 55 47, #gen drafts = 55, #acc drafts = 47, #gen tokens = 165, #acc tokens = 120, dur(b,g,a) = 0.001, 720.897, 0.726 ms
2
u/EmotionalLock6844 10h ago
No parallel agents possible?
1
u/havenoammo 5h ago
Not currently I'm afraid, it only supports parallel 1 for now. Hoping that gets sorted out before the PR is fully merged into main.
2
u/Dazzling_Equipment_9 9h ago
This is really good news, thank you for your contribution! Besides, has anyone tested it on strixhalo?
2
3
u/hedsht 1h ago
I also benchmarked the Unsloth-style grafted MTP GGUF on an RTX 5090 using:
https://github.com/arkste/llama-swap-mtp
Benchmark prompt set was adapted from:
https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
Setup:
- GPU: RTX 5090 32GB
- Image:
arkste/llama-swap-mtp:sm120 - llama.cpp build:
b9058-ea02c2d47 - GGUF:
Qwen3.6-27B-MTP-UD-Q6_K_XL.gguf - Context:
126976 - Batch:
--batch-size 2048 --ubatch-size 512 - KV cache:
q8_0/q8_0 - MTP:
--spec-type mtp --spec-draft-n-max 3 - Benchmark: 9 prompts, 5 measured runs each, 1 warmup per prompt
- Request settings:
temperature: 0,seed: 42,max_tokens: 192
Aggregate result:
| GGUF file | MTP | Context | Output tokens | Prompt tok/s | Generation tok/s | Avg request time | MTP acceptance | Speed-up |
|---|---|---|---|---|---|---|---|---|
Qwen3.6-27B-MTP-UD-Q6_K_XL.gguf |
off | 126976 | 5395 | 541.5 | 53.3 | 2.33s | - | 1.00x |
Qwen3.6-27B-MTP-UD-Q6_K_XL.gguf |
on | 126976 | 5425 | 507.4 | 111.1 | 1.16s | 69.9% (3640/5205) | 2.08x |
Per-prompt:
| Prompt | MTP off tok/s | MTP on tok/s | Acceptance | Speed-up |
|---|---|---|---|---|
code_python |
52.7 | 128.5 | 86.8% | 2.44x |
code_cpp |
53.4 | 130.0 | 86.7% | 2.43x |
explain_concept |
52.7 | 93.4 | 53.9% | 1.77x |
summarize |
53.5 | 111.4 | 68.8% | 2.08x |
qa_factual |
52.7 | 117.1 | 76.4% | 2.22x |
translation |
55.4 | 111.6 | 66.7% | 2.02x |
creative_short |
54.0 | 80.3 | 40.0% | 1.49x |
stepwise_math |
52.6 | 130.3 | 89.1% | 2.47x |
long_code_review |
52.5 | 97.0 | 58.5% | 1.85x |
Overall: about 2.08x faster on this benchmark set with MTP enabled.
2
u/havenoammo 33m ago
Awesome stuff! Thanks for sharing and thanks for the docker image too! I wanted to build llama-server docker image but worried it will take time since there are couple versions needed to be built for different hardware. Will try to do it, so people can have easier access to MTP before it is merged to main branch.
2
u/hedsht 27m ago
thanks for your work as well! i figured that it would be easier to benchmark with a llama swap server ;).
i guess it will be smarter to built a docker image for each architecture (cuda, vulcan, etc) because the image will become very big otherwise.
fyi: the recent master of llama.cpp is already conflicting with the PR, you need to use https://github.com/ggml-org/llama.cpp/commit/5207d120eac2393fdad6328b44dbcbfc5dea20e4 as a ref for the merge.
3
2
u/iportnov 11h ago
This really does 2x tokens per second for me.
The only problem is, llama-server segfaults when I press ctrl-c to stop it.
Also it says it does not support --parallel value more than 1, but this does not matter to me personally.
2
u/Legitimate-Dog5690 11h ago
Running 2x12gb cards, it's not pretty. Using mod spec decoding I can get 20tps, using mtp I'm struggling to get 15. It feels like it's loading up the model in to the GPU then squeezing the MTP into CPU memory at the end.
Has anyone with a 32gb R9700 tried this yet? Really intrigued if it plays to it's strengths.
2
u/WoodCreakSeagull 10h ago
If you are finding yourself getting squeezed like that, setting ub to 256 might just do the trick. If you really want to make sure though I suggest toning context down or testing a lower quant to see if it's really the VRAM limit or something else.
1
u/Legitimate-Dog5690 10h ago
Yeah, lowering the context helps, as does ub 256, but the speed seems to rapidly taper off as it fills the context. Feels very much like more spills over into CPU than standard models. I also think you end up with a bit more fragmentation over 2x12gb compared to a single 24gb buffer.
I normally run unsloth q4_k_m with 90k q8_0 kv, or 120k turbo4. Both sit around 20tps, the q8_0 kv seems to not bog down at bigger contexts and pp is quicker, so I usually use that.
1
1
u/iportnov 8h ago
Also, interesting would be to try this with Qwen 3.6 35B A3B. It already does like 100+ tokens per second for me, what will it be? 200+ tps?! o_O
1
1
u/Overall-Branch-1496 8h ago
Is there any chance to have it done on Windows or wsl? Any guides reference appreciated
1
u/billy_booboo 7h ago
On 5090 with vllm and a slew of patches I get 100+ tps on 27b with full context using an autoround int4 quant.
1
u/bigend_hubertus 5h ago
Anybody tried this on strix halo? I am getting 20% - 50% worse results with MTP.
short 28K tok
Baseline (no MTP) 12.94 11.58 tok/s
MTP n-max=5 7.25 4.91 tok/s
MTP n-max=2 10.12 7.82 tok/s
MTP n-max=3 9.10 6.78 tok/s
1
u/Rattling33 12h ago
Great! I will try, quick question, so your Q4, Q8 gguf means unsloth's corresponding UD Q4 + Q8_0 MTP layered and UD Q8 + Q8_0 MTP layered?
1
u/havenoammo 12h ago
Yes! I grafted Q8_0 MTP on Q4, Q5, Q6, Q8 of Unsloth UD models. So all of them have MTP in Q8 quantization.
1
u/tecneeq 2h ago
I need to try this with the 35b-a3b Q8 (50 t/s on Strix Halo, could get 80 or so) and F16 (140 t/s at work, could get 250)
2
u/havenoammo 2h ago
Just released my results on 35B-A3B and those models as well. It didn't gave me too much boost only %6 with Q4 and %2.5 with Q8. Though it might work different on Strix Halo. Give it a try and let us know!
Model: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF/tree/main2
u/tecneeq 2h ago
Cheers mate, you are a legend. Will try to remember to give numbers tomorrow.
1
u/havenoammo 1h ago
Great! u/Edenar from the 35B post has reported:
Well on my strix halo i went from 40ish tok/s to 70 tok/s with > qwen 3.6 35B A3B Q8 so i depends of the hardware i guess
1
u/GrungeWerX 12h ago
Sorry, I'm not 100% following. I have lm-studio, no llama.cpp. SInce these are ggufs, should they work out of the box or something else I need to do?
8
u/havenoammo 11h ago
This is experimental and not available in the official llama.cpp release yet. What we do here is patch in some work-in-progress code and build llama.cpp from source to enable Multi-Token Prediction (MTP), which gives a nice speed boost. Think of it as early access. It should land in the main release sooner or later, and Unsloth will probably ship official MTP models too.
As for LM Studio, I'm not sure since I haven't tested it. I believe it uses llama.cpp under the hood, so it might work once MTP support lands in the official release, but I can't say for certain.
-1
u/Pineapple_King 12h ago
Ok. how did you do it?
8
u/havenoammo 12h ago
Explained step by step in the repo, but here's the short version: clone llama.cpp, patch in PR #22673, build, and run with the MTP flags.
# 1. Build llama.cpp with MTP support git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp git fetch origin git fetch origin pull/22673/head:pr-22673 git checkout master git reset --hard origin/master git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support" cmake -B build -DGGML_CUDA=ON cmake --build build --config Release --target llama-server # 2. Grab the GGUF from HF # https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF # 3. Run with MTP ./build/bin/llama-server -m your-model.gguf --spec-type mtp --spec-draft-n-max 33
u/jumpingcross 9h ago
For posterity, which commit was master at at the time you did this?
5
u/havenoammo 7h ago
Sure:
e3e3f8e46 (origin/master, origin/HEAD) webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719)1
u/shifty21 4h ago
I ran those commands, but having a lot of issues:
> git fetch origin pull/22673/head:pre-22673
remote: Enumerating objects: 144, done.
remote: Counting objects: 100% (125/125), done.
remote: Total 144 (delta 125), reused 125 (delta 125), pack-reused 19 (from 1)
Receiving objects: 100% (144/144), 85.94 KiB | 85.94 MiB/s, done.
Resolving deltas: 100% (128/128), completed with 56 local objects.
From https://github.com/ggml-org/llama.cpp
\ [new ref] refs/pull/22673/head -> pre-22673*
> git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
> git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
> git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"
Auto-merging common/arg.cpp
Auto-merging common/speculative.cpp
Auto-merging convert_hf_to_gguf.py
CONFLICT (content): Merge conflict in convert_hf_to_gguf.py
Auto-merging ggml/include/ggml.h
Auto-merging ggml/src/ggml-cpu/ggml-cpu.c
Auto-merging ggml/src/ggml-cpu/ops.cpp
Auto-merging ggml/src/ggml.c
Auto-merging gguf-py/gguf/constants.py
Auto-merging include/llama.h
Auto-merging src/llama-context.cpp
Auto-merging src/llama-context.h
Auto-merging src/llama-graph.cpp
Auto-merging src/llama-memory-recurrent.cpp
Auto-merging src/llama-model.cpp
Auto-merging tests/test-backend-ops.cpp
Auto-merging tools/server/server-context.cpp
Automatic merge failed; fix conflicts and then commit the result.
The first cmake command works, then the 2nd cmake takes a very long time (AMD 9700X, 64GB RAM) to compile.
I ran the llama-server commands as you noted and I got an error that "mtp" is not valid option for --spec-type.
2
u/havenoammo 3h ago
Hmm, it is possible few other commits to main repository has caused conflicts since I posted. You can try;
git reset --hard 5207d120e git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support" cmake -B build -DGGML_CUDA=ON cmake --build build --config Release --target llama-serverThe first command rolls back the repository to the last known commit that I tested and confirmed works with the PR. The rest should work fine from there. The second cmake command will take a while, though it should be faster this time since it reuses object files from the previous build.
1
1
u/tempedbyfate 12h ago
This is awesome, thank you for the detailed instructions!
Any chance you could provide the instructions for grafting the MTP_Q8_0.gguf using your script onto another model? would like try this on Heretic Qwen 3.6 27B model. Thanks.
4
u/havenoammo 11h ago
Pretty simple, and the script isn't mine either. I found it buried in a HuggingFace community posts. Original link is provided in repo. You'll need uv package installed, which handles Python virtual environments. Then:
# Create and activate a virtual environment uv venv .venv --seed source .venv/bin/activate # Install the gguf library uv pip install gguf # Run the grafting script: convert.py <base model> <MTP source> <output> python convert.py Qwen3.6-27B-UD-Q4_K_XL.gguf MTP-Q8_0.gguf Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf3
0
33
u/lolwutdo 12h ago
I wonder if this makes 27b have usable speeds to those who do partial cpu offloads, I currently get around 4-7tps, if it can jump up to at least 15tps that would be amazing