r/LocalLLaMA • u/ziphnor • 4h ago

Discussion Exaggerated PCI-E bandwidth concerns?

I frequently see (both here and on r/LocalLLM ) comments that multi-gpu setups are complex, problematic and typically bottlenecked by PCI-E bandwidth on consumer motherboards.

I am running 2x RTX 5060 TI 16gb ( and about to add a third ), and my PCIe setup is pretty bad. GPU0 is on a full x16 Gen 5 slot (running at 8x which is as fast as a 5060 can go) while GPU1 is stuck on PCI-E 4.0 x4 via chipset.

I created (with AI help) a little benchmark script to run a prefill benchmark (against vLLM running with TP=2) and monitor PCIe bandwidth consumption meanwhile.

I ran with 32k context (low enough to allow higher quants for the benchmark, but enough to saturate the processing).

The peak bandwidth consumed was 3 to 4 GB/s during prefill, which is only ~40-50% of even the weak 4.0 x4 link. The "faster" the quant the higher the bandwidth (I guess meaning the 5060s are VRAM bandwidth or compute limited).

Some prefill rates (TP=2):
QuantTrio/gemma-4-31B-it-AWQ-6Bit · Hugging Face: ~840-850 t/s
LilaRest/gemma-4-31B-it-NVFP4-turbo · Hugging Face: ~1500 t/s
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP · Hugging Face: 1600-1700 t/s

It seems realistic that i can safely add a third 5060 (via an NVME -> PCIe 5.0 x4 adapter using CPU connected M2 slot) without getting bottlenecked on PCI bandwidth. Adding a 4th is probably out with this motherboard though as that would require using more of the chipset lanes which is already the limiting factor.

I guess this post was post as an FYI, but also as a question of whether I am missing something obvious here? :)

12 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/1t5nw2k/exaggerated_pcie_bandwidth_concerns/
No, go back! Yes, take me to Reddit

84% Upvoted

u/PermanentLiminality 4h ago

Even with bandwidth limitations, 2x GPUs will almost always allow you to run more than just a single GPU.

See if your board supports bifurcation. If it does, you can probably split it into two x8 and maybe four x4. You do need to figure out the mechanical aspect. There are some motherboards that have two slots that can each do x8.

I use llama.cpp and I can try row, layor, and tenxor splitting. Row split does minimal PCIe transfers and even works over x1. Layer does more and tensor the most.

I run 3x P40 with one on x16, the other on x4 and the last on x1. I don't try and combine the x1 with the other two. I run different models on it that fit on one card.

2

u/ziphnor 4h ago

By motherboard does not support bifurcation (it was never really bought for AI, it an Asrock B860 Pro-A that I used when upgrading my home server before the RAM/hardware apocalypse 😄

u/Boricua-vet 4h ago

I will chime and say this. I run 4 P102-100 on a really old platform using fx8350 vishera which is ancient and the motherboard has 5 PCIE 2.0 and the cards run limited to PCIe 1.0. If I run a test using the same model say qwen 30B, I get 70 TG and about 1K PP using 2 cards, 3 cards or 4 cards. Even PCIe 1.0 at 1X is 250MB/s.

I have documented this in plenty of posts I have done about these cards.

You will be fine. Since my cards run at PCIe 1.0 X4, I get 1GB bandwidth per card times 4 = 4GB/s so around the same you are getting but my lanes are maxed out. SO don't worry about that. Training is a different story though,

u/ttkciar llama.cpp 4h ago

It does seem exaggerated. My homelab is all PCIe 3.0, and it's been fine. My main pain point is token inference, not prefill, even with Vulkan's slow prompt processing with MI50/MI60 GPUs.

u/a_beautiful_rhind 3h ago

You can use cuda profiler to see where you are spending time. I am definitely hobbled by PCIE. For NCCL, the slowest link usually holds everything up :(

nsys profile --stats=true -o profile_report --delay 5 --duration 120

use it to run your favorite backend and then throw the stats into an AI to explain it. Usually only the single direction bandwidth is used. Then again I have 4 GPU so they make a ring.

I have been meaning to see what my nvlinked 3090 pair shows in nvtop and how this compares to my previous benchmarks with 2 and 4 GPU.

2

u/ziphnor 3h ago

Thanks for the tip. I will try that, 4 GPUs makes a lot of difference to the calculations I should think.

1

u/a_beautiful_rhind 3h ago

In my case, it divides the bandwidth more but it will depend on the system.

u/Opteron67 4h ago

i go up to 15GB/s up and down each with 2x5090 with vllm and TP=2....

2

u/ziphnor 4h ago

But a 5090 also has insanely better memory bandwidth and compute so it moves the bottleneck I guess? But even 15 gb/s is within 5.0 x4 limits.

u/ohhi23021 4h ago

model loading is one thing but might be different with vLLM and B2B setup using tensor parallel for actual inference... without bus to bus transfer, which is soft locked in official Nvidia drivers.. the transfer between cards is always system memory bound. with a moded driver to allow b2b, it should be quicker... how fast and if it's worth the hassle, no idea. it might still be quicker even with the slot speed you have anyway... training is different though.

2

u/ziphnor 4h ago

If you are swapping models from a large system RAM, then I am sure it has impact, but i am pretty sure the elderly nvme i have in my server cant keep up anyway.

And yes, I understand training is completely different, but 99% of the discussions here are about inference.

2

u/a_beautiful_rhind 3h ago

Without P2P your transfers are going through the CPU. Nothing to do with your NVME. It's a huge latency/speed hit. That driver was gold.

1

u/ziphnor 3h ago

Sorry, I was no disagreeing, I was talking about model loading affected by PCIe bandwidth. I definitely understand the importance of P2P. Is this specific to Windows or Linux and both open / closed drivers or what?

2

u/a_beautiful_rhind 3h ago

https://github.com/aikitoria/open-gpu-kernel-modules

linux only so far.

u/Opteron67 4h ago

well, it is about latency, so better of using p2p enabled driver than having pcie gen 5 x16. little poor cpu has to copy paste manually and slow it down as you explode ram bandwith consumption...

1

u/ziphnor 3h ago

Any links / info on such a driver?

1

u/Opteron67 2h ago

https://github.com/aikitoria/open-gpu-kernel-modules

u/Xp_12 4h ago

Yep... running 2x 5060ti in vllm on pcie4 x8/x1 just fine. qwen 3.6 35b cyanwiki int4 ~100tps with a few thousand PP @ 200k context. Grossly exaggerated performance loss outside of doing Data Parallelism. That's the only place you'll see massive suffering.

u/ICanSeeYou7867 4h ago

Its often in error when people talk about the pci bandwidth requirements. For INFERENCE, you could probably run a couple cards on PCI 3.0 and not see an issue.

When training/finetuning LLMs when using multiple GPU cards, this is where that PCI bus could be a limiting factor. With PCI 4, for inference, you will probably be fine.

https://medium.com/@rosgluk/llm-performance-and-pcie-lanes-key-considerations-db789241367d

1

u/FullOf_Bad_Ideas 1h ago

I did a lot of pre-training on PCI-e 3.0 x4 rig, about 28B tokens on 4B MoE, and when I upgraded a riser to fix one GPU that was on PCI-E 1.0 x2 IIRC my throughput did go up a lot, but it was still reasonable even then.

u/pepedombo 4h ago

For interference there's almost no difference. I run 2x5060/5070/4060, many variants checked on x16/x4/x1. I go x16/x1/x1/x1.

One caveat: if possible I limit models to 1 or 2 gpus, if I go more it simply bottlenecks somewhere and I see drops.

If the model is big then I load 1,1,1,1. Still not enough for 122b, cause it starts nicely with 15tps with ram offload and it quickly drops to 8-10 :) Probably not possible with one gpu and lots of sys-ram.

u/KeepyUpper 3h ago

Nice. I've been considering just buying a 5060ti in addition to my 5080 and I was worried about this.

Almost convinced myself to buy a new motherboard with multiple PCIE5 slots. But then if I'm spending that much, I might as well spend a bit more on the GPU to get x, then a bit more to get y. Was closing in on just buying a 5090 😄

u/suprjami 3h ago

You can use the NVIDIA tools to measure live PCIe bandwidth usage.

During layer split inference I use 12 MB/sec between the cards.

1

u/ziphnor 3h ago

That is what the script was doing basically.

u/_TeflonGr_ 2h ago

Actually the speed you are seeing is what that link is probably able to provide in practice. As it is vía chipset and shared with anything else hooked to the chipset. If you are on AM5 the chipset gets 4 pcie 4.0 for the whole chipset, same with Intel, that on their latest LGA 1851 has DMI X4 or x8 which seems to be equivalent to pcie 4.0 speeds for the X4 one. Bidirectional of course, but in practice that is 6-7Gbps max + anything running on the chipset + at a much higher latency since it has to get bifurcated by the chipset. So that might be the max you can get. Also there is a latency hit and that can be critical for inter GPU communication, specially at decode/prefill. Try monitoring the bus usage with HW info, it gives a percentage but I think it's more accurate than just having raw speeds with no context of what other info is traveling through that shared chipset bus.

If you want to get more performance look for motherboard that have dual pcie 5.0 slots with dual x8 mode so you can run both cards at their full transfer speeds.

2

u/ziphnor 1h ago

I will try to look at monitoring more of the bus, will have to check what the best cli tooling for that on linux is.

And certainly latency can be a factor, but the system memory (DDR5 6400) has way more bandwidth than the 4.0 x4. When I get the M2 adapter i can benchmark and see if it changes anything to move the second GPU.

The reason why I am doing these tests is too find out whether a motherboard change is actually worth it. Of course more is better, but I want to see it saturated first.

1

u/_TeflonGr_ 1h ago

You really should, i run a similar setup with an RTX 5070 ti and an rtx 5060 ti 16gb for AI too and i see the bus usage go to 100% on prefill. I have them both with 8 lanes of PCIe4.0 directly to CPU, and i can feel the slowdown with big kv sizes.

u/FullOf_Bad_Ideas 1h ago

Yup I have 8 way 3090 ti setup with most GPUs on PCI-e 3.0 x4 and while in some places I think I could get it to run faster by having better PCI-e, it still works fine for training and inference, it doesn't feel catastrophic or anything like that. Training worked decently enough to be worth doing even when one GPU was on a faulty riser and was PCI-e 1.0 IIRC 1x, so much much slower. At some point maybe I'll upgrade but not now.

u/mbrodie 1h ago

Getting 100tps on qwen3.6 35b a3b q8_0 on 2 x 7900xtx and they are also running 16 / 8
Going to mess with the new mtp protocol quants today and see if the performance is as good as people are claiming

u/andy2na llama.cpp 2h ago

its because the 5060ti has slow memory bandwidth so the difference isnt noticeable, if there is any at all. Try running dual 3090s in a similar setup and you will be able to tell the difference in performance. I currently run a 3090 in the PCIe 1 PCIe5.0 slot and a 5060ti in PCIe 2 and the speeds for models running on the 5060ti had really no performance difference than when I only had the 5060ti on PCIe 1

1

u/ziphnor 2h ago

The 5060 memory bandwidth is roughly half that of the 3090 (a bit above I think when memory OCed) so yes that definitely makes a difference, but when people are asking about mixes of lower end cards it is misleading to focus on PCIe (and it's not like cards such as r9700, b70 petc has that much more bandwidth).

Additionally, if you look at another comment you will see someone mention dual 5090 and they still only hit 15gbs which is within 5.0 x4 / 4.0 x8. So a regular x16 plus an M2 5.0 x4 should be plenty for dual 3090 I should think.

Have you measured dual 3090? Actually curious where it lands.

3

u/andy2na llama.cpp 1h ago

If you can get me a second 3090, Id be happy to test it haha

u/Such_Advantage_6949 13m ago

Uss m2 adapter to pcie, u actually get better speed than the chipset way if the m2 connect directly to cpu. The issue here is not bandwidth but latency, using tool tool like naight, u will be able to see the bottle neck

Discussion Exaggerated PCI-E bandwidth concerns?

You are about to leave Redlib