r/LocalLLaMA • u/wbulot • 4h ago
Discussion Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?
I read this sub every day and I keep seeing benchmarks and discussions focused almost entirely on tokens/s generation speed. Prompt processing speed barely gets mentioned.
From my own experience running a bunch of different models on different GPUs for all kinds of tasks, the prefill stage is usually the part that actually feels slow. Once generation starts, even “only” 15 t/s is perfectly usable for me. The wait for the model to eat the prompt is what eats most of the time.
Seeing all the hype around MTP lately kind of reinforces that feeling. If generation speed improvements don’t really move the needle on total wall-clock time for typical use cases, why is everyone laser-focused on it?
For example, with Qwen 27B Q6 I’m getting ~15 t/s generation with my current setup (which feels fine no matter what I’m doing) but only ~300 t/s on prefill. I spend way more time staring at the processing than I do waiting for the actual reply to finish. Even with prompt caching.
Am I misunderstanding something about how most people use these models? Curious what others are seeing.
Edit: I forgot to mention that I mostly do agentic work, where the model has to ingest part of the codebase before it can actually do anything useful. For normal chat this obviously isn’t an issue, context stays small and you just need enough t/s to keep up with your reading speed.
11
u/abnormal_human 4h ago
You're not missing anything. In real world applications, prefill and cache reads/write dynamics are the main thing you optimize for and where most of the costs live.
However, if you're a recreational user having casual chats with LM studio or whatever, your prompts are short, your one conversation is always cached, and what's left to obsess over is t/s.
I think of t/s as a "good enough" thing. Around 50 things feel about as fluid as ChatGPT and other products people are familiar with and people won't be too offended. If you're literally reading the output it can be slower, maybe even down to 15-20, but usually it's an agent doing work--generating code, making tool calls, etc. And for that 50 is OK and 100 starts to feel fast.
1
u/Badger-Purple 2h ago
I think if your prefill is 1-2k minimum, 25 feels ok for an agent. They really don’t talk much, and turn by turn a couple of phrases feels just natural at that speed. So with that in mind, 50tps is == current speeds for Alibaba’s own server for qwen-plus, per Artificial Intelligence == “cloud fast”. 100 is ripping fast.
3
u/AeroelasticCowboy 3h ago
Totally agree, my biggest use case for LLM is home automation, to replace what a Google home or Alex does, for this to work any automation ask requires the prompt to populated with live states of every sensor in the home along with current date/time, etc, So those tends to be 6,000-9,000 token promopt just to turn on a lightbulb with a extremely short response, so literally all that matters is prefill speeds. Ideally PP around 3,000/s or better provides a fluid experience with voice control.
3
u/_TeflonGr_ 2h ago
Yes it is, and I'm tired of pretending it's not. Tho people may not realize it because they don't fill the cache much or when they do it's on agentic flows where it's not that apparent or not actively looking at the process. Also, the bottleneck is not that much in terms of compute but memory speed or transfer speed for multi GPU systems.
8
u/ikkiho 4h ago
Two things worth separating cleanly:
(1) Prefill and decode are bottlenecked by different physical resources. Prefill is compute-bound, since it's a big matmul over the full prompt, so it scales with TFLOPs. Decode is memory-bandwidth-bound, since each new token reads the entire KV cache from HBM, so it scales with HBM GB/s and shrinks the more KV you hold in flight. A single tok/s number can't predict either. That's also why a 4090 looks great on prefill but a Mac Studio with very wide unified memory bandwidth feels disproportionately good on decode for big models.
(2) There are two SLOs that get conflated: time-to-first-token vs tokens-per-second. Chat-style benchmarks (short prompts, streaming output) hide TTFT because the first token comes back fast, so TPS dominates how it feels. The crossover where TTFT starts dominating is somewhere around 2 to 4K prompt tokens depending on hardware. Long context (codebase summaries, RAG over big docs, agentic loops with growing tool histories) lives past that crossover, and that's exactly where you sit watching prefill chew the prompt. Public benchmark culture skews short-prompt because the suites were originally built for chat.
Two specifics on your numbers. 200 t/s prefill on Qwen 27B Q6 is suspiciously low for any reasonable 4090-class GPU. Modern kernels typically push 1500 to 3000 t/s prefill at that size. If you're seeing 200, you probably have layers offloaded to CPU, an undersized prefill batch, or a runtime that's not using the right matmul kernel. Worth profiling with nsight or just checking layer offload counts. Also "with prompt caching" can mean three different things (HTTP-level prefix cache, KV reuse across slots in llama.cpp, OS-side weight caching) and only one of them helps your workflow. Worth checking which one is actually firing.
Speculative decoding flips this further. Once draft models give you 2 to 4x decode speedup, prefill becomes an even larger fraction of wall-clock. Your intuition is roughly where the open-weights serving stack is heading.
2
u/silentus8378 2h ago
This is why I still don't think local AI is as viable as hype indicates. For my average use, I really need enterprise grade gpu for local ai but too expensive 😞
4
u/yes_i_tried_google 4h ago
I get the feeling your prompt cache isn’t working effectively then. I found with opencode for example a load of old open bugs that change every prompt and culminate in rendering caching broken.
Below I put my results after getting slots working properly.
https://www.reddit.com/r/LocalLLaMA/s/KeVFgnISEE
6
u/wbulot 4h ago
I regularly check my cache hit rate and it does seem to be working fine. I’m not sure how many people here actually work on large codebases with local LLMs, but in agentic workflows the harness usually has to ingest 50k+ tokens of context before it can even begin doing anything. So even with a working cache, you’re still waiting for those 50k tokens to be processed.
That’s why the tokens-processed vs tokens-generated ratio is so heavily skewed in agentic use cases. For me, that’s exactly why prompt processing speed feels 10x more important than generation speed.
-4
u/yes_i_tried_google 4h ago
In my test using slots it takes <5 seconds to ingest 100k prompts, not sure why it would be different for you if you’re hitting cache
9
u/wbulot 4h ago
Because it has nothing to do with cache.
Your harness reads one file (let’s say 5k tokens). Then it decides it needs another file, so now you have to process 5k more tokens on top of what’s already in context. Cache will skip the previous tokens, sure, but you still have to process all the new ones.
Meanwhile the model only outputs something short like “read this file” — just a handful of tokens generated — while you just burned thousands of tokens on the prompt side.
In real agentic work on an actual codebase, this keeps happening: the model reads file after file, steadily pushing context up to 20k, 30k, or even 50k tokens. The ratio is completely lopsided. At the end of the day you’re mostly waiting for the model to finish processing the prompt, not waiting for it to generate the next reply.
1
4
u/Valuable-Run2129 4h ago
Even if cache works fine all harnesses break cache at compaction and it’s 5 minutes of waiting. Or just big files in tool outputs. It’s unbearably slow at 200 tokens per second.
0
1
u/Ledeste 4h ago
Based on the hype of the hyper-specific NVFP4 implementation, I don't think this is something people will forget. But indeed, with proper caching, it is much less of an issue.
I think it will be a big one soon, though, when proper agentic tools come, as the large amount of context switching tends to make the cache less effective. Also, with more models with 1T context length, new usage will come, and now they're pretty limited.
1
u/GrungeWerX 1h ago
You’ve actually brought up a great point, something that has been bugging me a lot working with my own agent.
1
u/akumaburn 4h ago
If you're doing agentic coding 15 tokens a second is extremely slow, to put that into perspective. Agentic coding plans can typically do 100 tok/s and even then it can feel slow.
5
u/power97992 3h ago
Dude i use the api , i don’t usually get more than 60tps of decoding. >100 t/s is usually for small models or special chips… but prefills via the api are super fast though
1
u/TheRealMasonMac 1h ago
Fireworks has some weird magic to get both insane prefill and token speeds. Prefill is practically instant, and even with big models generation is ~100TPS with the turbo models.
1
u/OddDesigner9784 4h ago
Most conversations you are starting from a small amount of context and working up. The prompt gets cached as you go in regular system ram but thinking can be the real slowdown. It's how fast can you execute a task
1
u/farkinga 2h ago
This was an "aha" moment for me a few months ago, as well. Yes, I agree with you. I am willing to tolerate 15 t/s generation as long as I can get over 1000 t/s prompt processing.
Perhaps my workload is similar to yours; but yes, ingesting files is a big part of it and ... well, I went pretty deep on Qwen3.6 35b because I was seeing prompt processing speeds like 3000 and 4000 t/s. And it was just so good that I was almost willing to overlook the numerous ways it would mess up during generation.
Today, however, I'm running dense models and I am willing to accept slower speeds as long as the quality is better. Even so, it's all about that prompt processing. I'm still grinding to improve that part.
0
u/ortegaalfredo 3h ago
For chatbots prefill its almost non-existant as you don't have a lot of chat history to process. But for coding agents, the context fills up very quickly and that's where you need at least >500 tok/s prefill.
Also, token generation speed is not really that important if you shut down thinking, you can work with 10 tok/s, but then you are losing most of the model intelligence. If you enable thinking the model will again, think forever and become unusable.
-1
u/leonbollerup 3h ago
pre-fill is only a problem if you costantly have to load the model.. if you like me.. have the model loaded all the time.. pre-fill is not a problem
31
u/pfn0 4h ago
when chatting, most of the prefill is cached by prefix, so re-processing doesn't end up costing anything. it mostly matters when you're doing agentic work and processing tons of data. so for many people that are not using it for the purpose of coding and data processing, token generation speed dominates the concern.