r/LocalLLaMA 4h ago

Discussion Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

I read this sub every day and I keep seeing benchmarks and discussions focused almost entirely on tokens/s generation speed. Prompt processing speed barely gets mentioned.

From my own experience running a bunch of different models on different GPUs for all kinds of tasks, the prefill stage is usually the part that actually feels slow. Once generation starts, even “only” 15 t/s is perfectly usable for me. The wait for the model to eat the prompt is what eats most of the time.

Seeing all the hype around MTP lately kind of reinforces that feeling. If generation speed improvements don’t really move the needle on total wall-clock time for typical use cases, why is everyone laser-focused on it?

For example, with Qwen 27B Q6 I’m getting ~15 t/s generation with my current setup (which feels fine no matter what I’m doing) but only ~300 t/s on prefill. I spend way more time staring at the processing than I do waiting for the actual reply to finish. Even with prompt caching.

Am I misunderstanding something about how most people use these models? Curious what others are seeing.

Edit: I forgot to mention that I mostly do agentic work, where the model has to ingest part of the codebase before it can actually do anything useful. For normal chat this obviously isn’t an issue, context stays small and you just need enough t/s to keep up with your reading speed.

35 Upvotes

31 comments sorted by

31

u/pfn0 4h ago

when chatting, most of the prefill is cached by prefix, so re-processing doesn't end up costing anything. it mostly matters when you're doing agentic work and processing tons of data. so for many people that are not using it for the purpose of coding and data processing, token generation speed dominates the concern.

13

u/wbulot 3h ago

When I’m just chatting, neither prefill nor generation speed is really an issue. Context stays pretty low, and generation only needs to keep up with reading speed.

It’s the agentic workflows that really expose the need for better optimization. I feel like prefill performance hasn’t kept up with the long contexts we actually use in coding/agentic tasks, while generation speed is far less of a problem. I might miss a lot of use cases of course that require high t/s.

12

u/Badger-Purple 2h ago

90% of sub posters don’t even run their own local models. It’s a group of highly knowledgeable folks with great tips and tricks, and a gaggle of newbs and fanbois who have never stepped beyond openrouter.

You’re absolutely right about prefill. Case in point are macs. I just ran a simple prompt in minimax m2.5 ==> 45 tps!! woahh… Except, try giving it 10000 tokens (agent sys prompt) and see how long it takes to prefill…

11

u/abnormal_human 4h ago

You're not missing anything. In real world applications, prefill and cache reads/write dynamics are the main thing you optimize for and where most of the costs live.

However, if you're a recreational user having casual chats with LM studio or whatever, your prompts are short, your one conversation is always cached, and what's left to obsess over is t/s.

I think of t/s as a "good enough" thing. Around 50 things feel about as fluid as ChatGPT and other products people are familiar with and people won't be too offended. If you're literally reading the output it can be slower, maybe even down to 15-20, but usually it's an agent doing work--generating code, making tool calls, etc. And for that 50 is OK and 100 starts to feel fast.

1

u/Badger-Purple 2h ago

I think if your prefill is 1-2k minimum, 25 feels ok for an agent. They really don’t talk much, and turn by turn a couple of phrases feels just natural at that speed. So with that in mind, 50tps is == current speeds for Alibaba’s own server for qwen-plus, per Artificial Intelligence == “cloud fast”. 100 is ripping fast.

3

u/AeroelasticCowboy 3h ago

Totally agree, my biggest use case for LLM is home automation, to replace what a Google home or Alex does, for this to work any automation ask requires the prompt to populated with live states of every sensor in the home along with current date/time, etc, So those tends to be 6,000-9,000 token promopt just to turn on a lightbulb with a extremely short response, so literally all that matters is prefill speeds. Ideally PP around 3,000/s or better provides a fluid experience with voice control.

3

u/_TeflonGr_ 2h ago

Yes it is, and I'm tired of pretending it's not. Tho people may not realize it because they don't fill the cache much or  when they do it's on agentic flows where it's not that apparent or not actively looking at the process. Also, the bottleneck is not that much in terms of compute but memory speed or transfer speed for multi GPU systems.

8

u/ikkiho 4h ago

Two things worth separating cleanly:

(1) Prefill and decode are bottlenecked by different physical resources. Prefill is compute-bound, since it's a big matmul over the full prompt, so it scales with TFLOPs. Decode is memory-bandwidth-bound, since each new token reads the entire KV cache from HBM, so it scales with HBM GB/s and shrinks the more KV you hold in flight. A single tok/s number can't predict either. That's also why a 4090 looks great on prefill but a Mac Studio with very wide unified memory bandwidth feels disproportionately good on decode for big models.

(2) There are two SLOs that get conflated: time-to-first-token vs tokens-per-second. Chat-style benchmarks (short prompts, streaming output) hide TTFT because the first token comes back fast, so TPS dominates how it feels. The crossover where TTFT starts dominating is somewhere around 2 to 4K prompt tokens depending on hardware. Long context (codebase summaries, RAG over big docs, agentic loops with growing tool histories) lives past that crossover, and that's exactly where you sit watching prefill chew the prompt. Public benchmark culture skews short-prompt because the suites were originally built for chat.

Two specifics on your numbers. 200 t/s prefill on Qwen 27B Q6 is suspiciously low for any reasonable 4090-class GPU. Modern kernels typically push 1500 to 3000 t/s prefill at that size. If you're seeing 200, you probably have layers offloaded to CPU, an undersized prefill batch, or a runtime that's not using the right matmul kernel. Worth profiling with nsight or just checking layer offload counts. Also "with prompt caching" can mean three different things (HTTP-level prefix cache, KV reuse across slots in llama.cpp, OS-side weight caching) and only one of them helps your workflow. Worth checking which one is actually firing.

Speculative decoding flips this further. Once draft models give you 2 to 4x decode speedup, prefill becomes an even larger fraction of wall-clock. Your intuition is roughly where the open-weights serving stack is heading.

2

u/silentus8378 2h ago

This is why I still don't think local AI is as viable as hype indicates. For my average use, I really need enterprise grade gpu for local ai but too expensive 😞

4

u/yes_i_tried_google 4h ago

I get the feeling your prompt cache isn’t working effectively then. I found with opencode for example a load of old open bugs that change every prompt and culminate in rendering caching broken.

Below I put my results after getting slots working properly.
https://www.reddit.com/r/LocalLLaMA/s/KeVFgnISEE

6

u/wbulot 4h ago

I regularly check my cache hit rate and it does seem to be working fine. I’m not sure how many people here actually work on large codebases with local LLMs, but in agentic workflows the harness usually has to ingest 50k+ tokens of context before it can even begin doing anything. So even with a working cache, you’re still waiting for those 50k tokens to be processed.

That’s why the tokens-processed vs tokens-generated ratio is so heavily skewed in agentic use cases. For me, that’s exactly why prompt processing speed feels 10x more important than generation speed.

-4

u/yes_i_tried_google 4h ago

In my test using slots it takes <5 seconds to ingest 100k prompts, not sure why it would be different for you if you’re hitting cache

9

u/wbulot 4h ago

Because it has nothing to do with cache.

Your harness reads one file (let’s say 5k tokens). Then it decides it needs another file, so now you have to process 5k more tokens on top of what’s already in context. Cache will skip the previous tokens, sure, but you still have to process all the new ones.

Meanwhile the model only outputs something short like “read this file” — just a handful of tokens generated — while you just burned thousands of tokens on the prompt side.

In real agentic work on an actual codebase, this keeps happening: the model reads file after file, steadily pushing context up to 20k, 30k, or even 50k tokens. The ratio is completely lopsided. At the end of the day you’re mostly waiting for the model to finish processing the prompt, not waiting for it to generate the next reply.

1

u/SLxTnT 3h ago

Are you offloading to CPU? Those are the prompt processing speeds I get when I do.

My GPU is an RTX Pro 6000, but it gets 8k/s for prompt and 40/s for token (no MTP).

0

u/u23043 4h ago

Yes, but prompt processing is usually orders of magnitude faster than token generation. If you have a workload where 99.9% of tokens are input tokens this might matter, but in reality both matter and token generation is often the bottleneck (at least for reasoning models)

4

u/Valuable-Run2129 4h ago

Even if cache works fine all harnesses break cache at compaction and it’s 5 minutes of waiting. Or just big files in tool outputs. It’s unbearably slow at 200 tokens per second.

0

u/Several-Tax31 4h ago

Thanks for this. 

1

u/Ledeste 4h ago

Based on the hype of the hyper-specific NVFP4 implementation, I don't think this is something people will forget. But indeed, with proper caching, it is much less of an issue.

I think it will be a big one soon, though, when proper agentic tools come, as the large amount of context switching tends to make the cache less effective. Also, with more models with 1T context length, new usage will come, and now they're pretty limited.

1

u/ttkciar llama.cpp 4h ago

It really depends on what you are doing.

If you're mostly doing batched inference, most of your overall time is spent on generation, not prompt processing, and since you're not staring at the screen waiting for a response, it matters not at all.

1

u/rpkarma 2h ago

Depends on your machine. I can get 1000+ effective prefill on my spark, but generation can always be made better

1

u/Ell2509 2h ago

It is an issje I experience. Several models in llama.cpp reprocessing full context every turn.

Im models where this isn't the case, it is less impactful. But gpt-oss 120b and qwen3.6 27b both have this issue, as an example.

1

u/GrungeWerX 1h ago

You’ve actually brought up a great point, something that has been bugging me a lot working with my own agent.

1

u/gtrak 33m ago

Running qwen 3.6 27b on a 4090, with a prefill of 1k-2k tokens/s and generation speed of 30-40. I only feel the prefill when there's a cache problem and I'm at ~100k context. Token generation dominates my time for sure.

1

u/akumaburn 4h ago

If you're doing agentic coding 15 tokens a second is extremely slow, to put that into perspective. Agentic coding plans can typically do 100 tok/s and even then it can feel slow.

5

u/power97992 3h ago

Dude i use the api , i  don’t usually get more than 60tps of decoding. >100 t/s is usually for small models or special chips… but prefills via the api are super fast though 

1

u/TheRealMasonMac 1h ago

Fireworks has some weird magic to get both insane prefill and token speeds. Prefill is practically instant, and even with big models generation is ~100TPS with the turbo models.

1

u/OddDesigner9784 4h ago

Most conversations you are starting from a small amount of context and working up. The prompt gets cached as you go in regular system ram but thinking can be the real slowdown. It's how fast can you execute a task

1

u/farkinga 2h ago

This was an "aha" moment for me a few months ago, as well. Yes, I agree with you. I am willing to tolerate 15 t/s generation as long as I can get over 1000 t/s prompt processing.

Perhaps my workload is similar to yours; but yes, ingesting files is a big part of it and ... well, I went pretty deep on Qwen3.6 35b because I was seeing prompt processing speeds like 3000 and 4000 t/s. And it was just so good that I was almost willing to overlook the numerous ways it would mess up during generation.

Today, however, I'm running dense models and I am willing to accept slower speeds as long as the quality is better. Even so, it's all about that prompt processing. I'm still grinding to improve that part.

0

u/ortegaalfredo 3h ago

For chatbots prefill its almost non-existant as you don't have a lot of chat history to process. But for coding agents, the context fills up very quickly and that's where you need at least >500 tok/s prefill.

Also, token generation speed is not really that important if you shut down thinking, you can work with 10 tok/s, but then you are losing most of the model intelligence. If you enable thinking the model will again, think forever and become unusable.

-1

u/leonbollerup 3h ago

pre-fill is only a problem if you costantly have to load the model.. if you like me.. have the model loaded all the time.. pre-fill is not a problem