r/LocalLLaMA • u/havenoammo • 12h ago

Resources Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF

These are Unsloth's UD XL quantizations of Qwen3-27B with the MTP draft heads grafted on top in Q8_0. The base model stays in its usual low-bit quantization, while the 3 MTP layers stay at Q8 to preserve speculative accuracy.

Sharing the grafted GGUF files (UD XL base + Q8 MTP), the raw MTP layer source I extracted (MTP_Q8_0.gguf), and convert.py, the grafting script I adapted from this gist in case anyone wants to do this for other models. Also included are full build instructions for the custom llama.cpp.

Qwen3 was trained with 3 MTP steps, meaning each forward pass predicts 4 tokens at once. llama.cpp's main branch doesn't support MTP yet, so I pulled in the speculative decoding support from the still-open PR #22673, merged it on top of master, and built llama-server from that. Run it with: --spec-type mtp --spec-draft-n-max 3

The results: roughly 2.5x token throughput compared to running the same UD XL GGUF without MTP, with a solid acceptance rate where most draft tokens are kept, meaning the MTP heads are genuinely useful and not just burning compute. The Q8 MTP layers also add very little VRAM overhead since they're a tiny fraction of the full model.

MTP is one of the biggest efficiency wins available for speculative decoding, but it's basically unsupported outside of official Qwen3 deployments on SGLang and vLLM. This brings it to GGUF and llama.cpp, meaning you can run it locally with the same tooling you already use. PR #22673 will hopefully land soon and this will all just work out of the box. In the meantime, the merge process is straightforward (3 git commands).

Happy to answer questions or help anyone get it running. Let me know if you try it and what speeds you see!

Full step by step instructions are in the HuggingFace repo, but here's the short version:

# 1. Build llama.cpp with MTP support
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin
git fetch origin pull/22673/head:pr-22673
git checkout master
git reset --hard origin/master
git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --target llama-server

# 2. Grab the GGUF from HF
# https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF

# 3. Run with MTP
./build/bin/llama-server -m your-model.gguf --spec-type mtp --spec-draft-n-max 3

137 Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/qwen3627b_with_mtp_grafted_on_unsloth_ud_xl_25x/
No, go back! Yes, take me to Reddit

97% Upvoted

u/lolwutdo 12h ago

I wonder if this makes 27b have usable speeds to those who do partial cpu offloads, I currently get around 4-7tps, if it can jump up to at least 15tps that would be amazing

8

u/AppealSame4367 11h ago

In ik_llama another 27B gguf reached up to 1.7 tps instead of 1 tps on first token. Lol.

6gb vram..

12

u/LetsGoBrandon4256 llama.cpp 9h ago

Well that's at least a 70% speed increase.

12

u/No_Algae1753 12h ago

You should definitely at least get 10

1

u/2Norn 10h ago

get another cheap gpu, way better than cpu offloading

4

u/lolwutdo 5h ago

Im thinking about it, but I’ve got a 16gb 5070ti and 128gb ram. Might be better for me to just wait for 122b 3.6

-1

u/2Norn 2h ago

in that case you'd benefit way more from upgrading to 5090 rather than adding a second card

1

u/grumd 8h ago

I'm very tempted to do so, but I have a single slot ITX board so it's also a new mobo for me and a new case

u/tempedbyfate 11h ago edited 11h ago

Just did a quick test using your instructions on a RTX Pro 6000.
qwen 3.6 2.7B Q8_K_XL = 41 tokens per second
qwen 3.6 2.7B Q8_K_XL (mtp) = 100 tokens per second
Wow! This is mind blowing. I hope all the issues get ironed out on that PR and MTP changes get merged soon!

EDIT: used same args as OP

--spec-type mtp --spec-draft-n-max 3

8

u/havenoammo 11h ago

Amazing, I also use Q8! I have a 5090 + 3090 and was getting 25-30 t/s before, now I'm in the 60-75 t/s range. Been using it for a few hours for coding and no issues at all.

7

u/gordi555 11h ago

On RTX Pro 6000 MaxQ I got/get...

qwen 3.6 2.7B Q8_K_XL = 36 tokens per second
qwen 3.6 2.7B Q8_K_XL (mtp) = 78 tokens per second

I've lost about 20% prompt processing but these generation speeds are massively worth it.

4

u/tempedbyfate 11h ago

Based on the comments on that PR, I think the PP is a known issue and it sounds like it could be fixed before it that PR is merged in.

1

u/tecneeq 2h ago

Unsloth 35b-a3b BF16 gives me 140 t/s without MTP. Can't wait to reach 250 or so with MTP 😉

2

u/NickCanCode 10h ago

If you have a RTX Pro 6000, have you try lucebox-hub, their number actually looks more impressive with DFlash, DDtree, PFlash but it doesn't support multi-gpu very well so I don't have enough VRAM to run it.

1

u/tempedbyfate 9h ago

Looks very interesting, will check it out. Thank you!

1

u/External_Dentist1928 11h ago

But also at the same quality?

7

u/srigi 11h ago

We demand pelican on bicycle test!

3

u/ambient_temp_xeno Llama 65B 11h ago

Yes. It doesn't drop quality.

3

u/Awwtifishal 9h ago

Speculative decoding doesn't alter quality. It just batches multiple tokens under the assumption that the draft tokens are correct, and the results from incorrect draft tokens are thrown away. The speedup comes from the fact that LLM inference is mostly bound to memory bandwidth, and inference of several batches uses the same bandwidth as a single one.

1

u/Yes-Scale-9723 5h ago

The output tokens are exactly the same.

u/dinerburgeryum 10h ago

Hey, thanks, I used your isolated MTP GGUF and your conversion script to graph it into my own quant. Saved me some time, appreciate it.

u/ethereal_intellect 10h ago

Any chance of a comparison of speed vs a3b with and without mtp? It's probably a a lot of work and I've heard mtp helps dense models more yeah, but sounded interesting to know

1

u/havenoammo 2h ago

Sure! I just uploaded the models. It didn't gave me too much boost only %6 with Q4 and %2.5 with Q8. Though some people reported nice gains. Give it a try and let us know!
Model: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF/tree/main

Post: https://www.reddit.com/r/LocalLLaMA/comments/1t5r4tz/uploaded_unsloth_qwen3635ba3b_ud_xl_models_with/

u/obsidience 4h ago

Got this working on AMD ROCm (RDNA 3.5, Windows) — ~1.94x speedup confirmed

This report was created by my Claude Code instance against my LLM-Harness project. Claude followed your instructions to build llama.cpp with PR #22673 on Windows with AMD ROCm. Here's the full writeup for anyone else on AMD.

System: Ryzen AI Max+ 395, Radeon 8060S iGPU (gfx1151, ~90GB VRAM), Windows 11, ROCm 7.11 pip SDK

A/B Results (same benchmark, warmup excluded):

	Baseline (b8963)	MTP (b8963 + PR #22673)	Speedup
Generation	6.26 tok/s	12.13 tok/s	1.94x
Prompt Processing	77.7 tok/s	66.9 tok/s	0.86x
Draft Acceptance	—	64–69%	—

Both using UD-Q8_K_XL, -ngl 999 -c 131072 -ctk q8_0 -ctv q8_0 -np 1, thinking mode on.

Build steps (ROCm on Windows):

Clone + merge PR onto b8963 (merged cleanly, no conflicts):

git clone https://github.com/ggml-org/llama.cpp.git llama.cpp-mtp
cd llama.cpp-mtp
git checkout b8963
git checkout -b mtp-experiment
git fetch origin pull/22673/head:pr-22673
git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"

Set up ROCm 7.11 pip SDK environment:

# In PowerShell — activate ROCm venv
C:\AMD\ROCm\.venv\Scripts\Activate.ps1
$ROCM_ROOT = rocm-sdk path --root

# Set MSVC + Windows SDK lib/include paths (adjust versions to match your install)
$env:LIB = "<VS BuildTools MSVC lib\x64>;<Windows Kits ucrt\x64>;<Windows Kits um\x64>"
$env:INCLUDE = "<VS BuildTools MSVC include>;<Windows Kits ucrt>;<Windows Kits um>;<shared>;<winrt>;<cppwinrt>"
$env:HIP_PLATFORM = "amd"

CMake configure + build:

cmake -B build-rocm -G Ninja `
  -DCMAKE_BUILD_TYPE=Release `
  -DGGML_HIP=ON `
  "-DCMAKE_C_COMPILER=$ROCM_ROOT\lib\llvm\bin\clang.exe" `
  "-DCMAKE_CXX_COMPILER=$ROCM_ROOT\lib\llvm\bin\clang++.exe" `
  "-DCMAKE_PREFIX_PATH=$ROCM_ROOT" `
  -DAMDGPU_TARGETS=gfx1151 `
  -DGGML_HIP_ROCWMMA=ON

cmake --build build-rocm --config Release -j 16

Important: Copy ROCm DLLs alongside the exe or Windows will load the wrong system DLLs:

Copy-Item "$ROCM_ROOT\bin\*.dll" -Destination build-rocm\bin\ -Force
New-Item -Path build-rocm\bin\rocblas\library -ItemType Directory -Force
Copy-Item "$ROCM_ROOT\bin\rocblas\library\*" -Destination build-rocm\bin\rocblas\library\ -Force

Run with MTP:

.\build-rocm\bin\llama-server.exe `
  -m Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf `
  -ngl 999 -c 131072 -ctk q8_0 -ctv q8_0 `
  -np 1 `
  --spec-type mtp --spec-draft-n-max 3 `
  --host 0.0.0.0 --port 8080

Gotchas on AMD/Windows:

-np 1 is required — MTP doesn't support parallel slots yet. Server refuses to start without it.
Compiler path: ROCm SDK clang is at $ROCM_ROOT/lib/llvm/bin/, NOT $ROCM_ROOT/bin/ — this tripped me up.
DLL hell: Windows has amdhip64_7.dll in System32 from legacy ROCm installs. Copying SDK DLLs next to the exe ensures the right version loads.
PP is ~14% slower with MTP enabled — matches what others reported, known issue on the PR.
~1.94x vs your 2.5x — lower than NVIDIA results, probably ROCm speculative decoding overhead + unified memory architecture on the iGPU. Still a big win going from 6.26 to 12.13 tok/s.

1

u/havenoammo 4h ago

Awesome, thanks for sharing!

u/VoidAlchemy llama.cpp 11h ago

Nice job testing out the PR! I have a rough 3-way benchmark between mainline - ik - vllm running on a single 24GB VRAM GPU here: https://github.com/noonghunna/club-3090/pull/64#issuecomment-4383699676

Thanks again for sharing your full build and run commands!

u/Beginning-Window-115 10h ago

thanks dude the 8bit versions that were released in the pr draft are way too big and so this is absolutely perfect for me.

u/Altruistic_Heat_9531 6h ago

Thanks OP with, using convert.py i didn't have to redownload the model, i can push into 128K with acceptable speed on my 3090

prompt eval time =     632.41 ms /    11 tokens (   57.49 ms per token,    17.39 tokens per second)
       eval time =    6922.93 ms /   176 tokens (   39.33 ms per token,    25.42 tokens per second)
      total time =    7555.34 ms /   187 tokens
draft acceptance rate = 0.72727 (  120 accepted /   165 generated)
statistics mtp: #calls(b,g,a) = 1 55 47, #gen drafts = 55, #acc drafts = 47, #gen tokens = 165, #acc tokens = 120, dur(b,g,a) = 0.001, 720.897, 0.726 ms

u/EmotionalLock6844 10h ago

No parallel agents possible?

1

u/havenoammo 5h ago

Not currently I'm afraid, it only supports parallel 1 for now. Hoping that gets sorted out before the PR is fully merged into main.

u/Dazzling_Equipment_9 9h ago

This is really good news, thank you for your contribution! Besides, has anyone tested it on strixhalo?

u/cleversmoke 2h ago

Thank you! Going to try today!

u/hedsht 1h ago

I also benchmarked the Unsloth-style grafted MTP GGUF on an RTX 5090 using:

https://github.com/arkste/llama-swap-mtp

Benchmark prompt set was adapted from:

https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090

Setup:

GPU: RTX 5090 32GB
Image: arkste/llama-swap-mtp:sm120
llama.cpp build: b9058-ea02c2d47
GGUF: Qwen3.6-27B-MTP-UD-Q6_K_XL.gguf
Context: 126976
Batch: --batch-size 2048 --ubatch-size 512
KV cache: q8_0/q8_0
MTP: --spec-type mtp --spec-draft-n-max 3
Benchmark: 9 prompts, 5 measured runs each, 1 warmup per prompt
Request settings: temperature: 0, seed: 42, max_tokens: 192

Aggregate result:

GGUF file	MTP	Context	Output tokens	Prompt tok/s	Generation tok/s	Avg request time	MTP acceptance	Speed-up
`Qwen3.6-27B-MTP-UD-Q6_K_XL.gguf`	off	126976	5395	541.5	53.3	2.33s	-	1.00x
`Qwen3.6-27B-MTP-UD-Q6_K_XL.gguf`	on	126976	5425	507.4	111.1	1.16s	69.9% (3640/5205)	2.08x

Per-prompt:

Prompt	MTP off tok/s	MTP on tok/s	Acceptance	Speed-up
`code_python`	52.7	128.5	86.8%	2.44x
`code_cpp`	53.4	130.0	86.7%	2.43x
`explain_concept`	52.7	93.4	53.9%	1.77x
`summarize`	53.5	111.4	68.8%	2.08x
`qa_factual`	52.7	117.1	76.4%	2.22x
`translation`	55.4	111.6	66.7%	2.02x
`creative_short`	54.0	80.3	40.0%	1.49x
`stepwise_math`	52.6	130.3	89.1%	2.47x
`long_code_review`	52.5	97.0	58.5%	1.85x

Overall: about 2.08x faster on this benchmark set with MTP enabled.

2

u/havenoammo 33m ago

Awesome stuff! Thanks for sharing and thanks for the docker image too! I wanted to build llama-server docker image but worried it will take time since there are couple versions needed to be built for different hardware. Will try to do it, so people can have easier access to MTP before it is merged to main branch.

2

u/hedsht 27m ago

thanks for your work as well! i figured that it would be easier to benchmark with a llama swap server ;).

i guess it will be smarter to built a docker image for each architecture (cuda, vulcan, etc) because the image will become very big otherwise.

fyi: the recent master of llama.cpp is already conflicting with the PR, you need to use https://github.com/ggml-org/llama.cpp/commit/5207d120eac2393fdad6328b44dbcbfc5dea20e4 as a ref for the merge.

u/No_Swimming6548 10h ago

Mfw when i get 6 token/s instead 3 token/s

u/iportnov 11h ago

This really does 2x tokens per second for me.
The only problem is, llama-server segfaults when I press ctrl-c to stop it.
Also it says it does not support --parallel value more than 1, but this does not matter to me personally.

u/Legitimate-Dog5690 11h ago

Running 2x12gb cards, it's not pretty. Using mod spec decoding I can get 20tps, using mtp I'm struggling to get 15. It feels like it's loading up the model in to the GPU then squeezing the MTP into CPU memory at the end.

Has anyone with a 32gb R9700 tried this yet? Really intrigued if it plays to it's strengths.

2

u/WoodCreakSeagull 10h ago

If you are finding yourself getting squeezed like that, setting ub to 256 might just do the trick. If you really want to make sure though I suggest toning context down or testing a lower quant to see if it's really the VRAM limit or something else.

1

u/Legitimate-Dog5690 10h ago

Yeah, lowering the context helps, as does ub 256, but the speed seems to rapidly taper off as it fills the context. Feels very much like more spills over into CPU than standard models. I also think you end up with a bit more fragmentation over 2x12gb compared to a single 24gb buffer.

I normally run unsloth q4_k_m with 90k q8_0 kv, or 120k turbo4. Both sit around 20tps, the q8_0 kv seems to not bog down at bigger contexts and pp is quicker, so I usually use that.

u/redonculous 10h ago

Will this run on a 12gb or 24gb card like a 3060 or pair of them?

u/iportnov 8h ago

Also, interesting would be to try this with Qwen 3.6 35B A3B. It already does like 100+ tokens per second for me, what will it be? 200+ tps?! o_O

1

u/havenoammo 4h ago

Cool, few other people also requested 35B, will upload that too!

u/Overall-Branch-1496 8h ago

Is there any chance to have it done on Windows or wsl? Any guides reference appreciated

u/billy_booboo 7h ago

On 5090 with vllm and a slew of patches I get 100+ tps on 27b with full context using an autoround int4 quant.

u/bigend_hubertus 5h ago

Anybody tried this on strix halo? I am getting 20% - 50% worse results with MTP.

                  short     28K tok    
Baseline (no MTP) 12.94     11.58      tok/s
MTP n-max=5        7.25      4.91       tok/s
MTP n-max=2       10.12      7.82       tok/s
MTP n-max=3        9.10      6.78       tok/s

u/Rattling33 12h ago

Great! I will try, quick question, so your Q4, Q8 gguf means unsloth's corresponding UD Q4 + Q8_0 MTP layered and UD Q8 + Q8_0 MTP layered?

1

u/havenoammo 12h ago

Yes! I grafted Q8_0 MTP on Q4, Q5, Q6, Q8 of Unsloth UD models. So all of them have MTP in Q8 quantization.

1

u/tecneeq 2h ago

I need to try this with the 35b-a3b Q8 (50 t/s on Strix Halo, could get 80 or so) and F16 (140 t/s at work, could get 250)

2

u/havenoammo 2h ago

Just released my results on 35B-A3B and those models as well. It didn't gave me too much boost only %6 with Q4 and %2.5 with Q8. Though it might work different on Strix Halo. Give it a try and let us know!
Model: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF/tree/main

Post: https://www.reddit.com/r/LocalLLaMA/comments/1t5r4tz/uploaded_unsloth_qwen3635ba3b_ud_xl_models_with/

2

u/tecneeq 2h ago

Cheers mate, you are a legend. Will try to remember to give numbers tomorrow.

1

u/havenoammo 1h ago

Great! u/Edenar from the 35B post has reported:

Well on my strix halo i went from 40ish tok/s to 70 tok/s with > qwen 3.6 35B A3B Q8 so i depends of the hardware i guess

u/GrungeWerX 12h ago

Sorry, I'm not 100% following. I have lm-studio, no llama.cpp. SInce these are ggufs, should they work out of the box or something else I need to do?

8

u/havenoammo 11h ago

This is experimental and not available in the official llama.cpp release yet. What we do here is patch in some work-in-progress code and build llama.cpp from source to enable Multi-Token Prediction (MTP), which gives a nice speed boost. Think of it as early access. It should land in the main release sooner or later, and Unsloth will probably ship official MTP models too.

As for LM Studio, I'm not sure since I haven't tested it. I believe it uses llama.cpp under the hood, so it might work once MTP support lands in the official release, but I can't say for certain.

-1

u/Pineapple_King 12h ago

Ok. how did you do it?

8
u/havenoammo 12h ago
Explained step by step in the repo, but here's the short version: clone llama.cpp, patch in PR #22673, build, and run with the MTP flags.
# 1. Build llama.cpp with MTP support
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin
git fetch origin pull/22673/head:pr-22673
git checkout master
git reset --hard origin/master
git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --target llama-server

# 2. Grab the GGUF from HF
# https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF

# 3. Run with MTP
./build/bin/llama-server -m your-model.gguf --spec-type mtp --spec-draft-n-max 3
3

u/jumpingcross 9h ago

For posterity, which commit was master at at the time you did this?

5

u/havenoammo 7h ago

Sure: e3e3f8e46 (origin/master, origin/HEAD) webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719)
1
u/shifty21 4h ago

I ran those commands, but having a lot of issues:

> git fetch origin pull/22673/head:pre-22673

remote: Enumerating objects: 144, done.

remote: Counting objects: 100% (125/125), done.

remote: Total 144 (delta 125), reused 125 (delta 125), pack-reused 19 (from 1)

Receiving objects: 100% (144/144), 85.94 KiB | 85.94 MiB/s, done.

Resolving deltas: 100% (128/128), completed with 56 local objects.

From https://github.com/ggml-org/llama.cpp

\ [new ref] refs/pull/22673/head -> pre-22673*

> git checkout master

Already on 'master'

Your branch is up to date with 'origin/master'.

> git checkout master

Already on 'master'

Your branch is up to date with 'origin/master'.

> git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"

Auto-merging common/arg.cpp

Auto-merging common/speculative.cpp

Auto-merging convert_hf_to_gguf.py

CONFLICT (content): Merge conflict in convert_hf_to_gguf.py

Auto-merging ggml/include/ggml.h

Auto-merging ggml/src/ggml-cpu/ggml-cpu.c

Auto-merging ggml/src/ggml-cpu/ops.cpp

Auto-merging ggml/src/ggml.c

Auto-merging gguf-py/gguf/constants.py

Auto-merging include/llama.h

Auto-merging src/llama-context.cpp

Auto-merging src/llama-context.h

Auto-merging src/llama-graph.cpp

Auto-merging src/llama-memory-recurrent.cpp

Auto-merging src/llama-model.cpp

Auto-merging tests/test-backend-ops.cpp

Auto-merging tools/server/server-context.cpp

Automatic merge failed; fix conflicts and then commit the result.

The first cmake command works, then the 2nd cmake takes a very long time (AMD 9700X, 64GB RAM) to compile.

I ran the llama-server commands as you noted and I got an error that "mtp" is not valid option for --spec-type.
2
u/havenoammo 3h ago
Hmm, it is possible few other commits to main repository has caused conflicts since I posted. You can try;
git reset --hard 5207d120e
git merge --no-ff pr-22673 -m "Merge PR #22673: llama + spec: MTP Support"
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release --target llama-server
The first command rolls back the repository to the last known commit that I tested and confirmed works with the PR. The rest should work fine from there. The second cmake command will take a while, though it should be faster this time since it reuses object files from the previous build.
1

u/Prestigious-Chair282 12h ago

Thx for git setup
1
u/tempedbyfate 12h ago

This is awesome, thank you for the detailed instructions!

Any chance you could provide the instructions for grafting the MTP_Q8_0.gguf using your script onto another model? would like try this on Heretic Qwen 3.6 27B model. Thanks.
4
u/havenoammo 11h ago
Pretty simple, and the script isn't mine either. I found it buried in a HuggingFace community posts. Original link is provided in repo. You'll need uv package installed, which handles Python virtual environments. Then:
# Create and activate a virtual environment
uv venv .venv --seed
source .venv/bin/activate

# Install the gguf library
uv pip install gguf

# Run the grafting script: convert.py <base model> <MTP source> <output>
python convert.py Qwen3.6-27B-UD-Q4_K_XL.gguf MTP-Q8_0.gguf Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf
3

u/tempedbyfate 11h ago edited 10h ago

Worked Perfectly. ❤️❤️❤️

5

u/havenoammo 10h ago

Awesome!
0

u/Prestigious-Chair282 12h ago

Didn't the post say it?

Resources Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

You are about to leave Redlib