_    _    _           _      
 | |_ (_)__| |_ ___ _ _(_)__ _ 
 | ' \| (_-<  _/ _ \ '_| / _` |
 |_||_|_/__/\__\___/_| |_\__,_|
                    historia.vg

Running Deepseek-R1-0528 on my gaming PC

I have:

CPU: AMD Ryzen 7 3700X
Motherboard: ASUS ROG X570 Crosshair VIII Hero
Memory: 48GB DDR4 3600 (RIP dead 16gb stick)
GPUs: 3x RTX 3090 (72gb VRAM)
PSU: Seasonic Prime TX-1600
Case: Lian Li O11 Dynamic Evo XL

I’ve run small LLMs locally on my 3090 to ask dumb Linux questions without Sam Altman knowing how bad at computing I am. Seeing the apparent gap between deepseeker-coder-33b-instruct and Claude eventually got me itching for more VRAM, so I added a second 3090. Then I came to the conclusion that jamming more GPUs into my midrange gaming PC is perhaps not the optimal bang for buck way to blow money for computing power, and I planned a big fancy AMD Epyc build to replace my existing home servers.

Then September-November 2025 happened and the memory to build such a server suddenly costs like $7000.

So to make myself feel better, I bought another 3090 for $700 instead. Surely more VRAM will make me happier. And hey, now I can run GLM Air 4.5 at like 10t/s instead of 6t/s I guess.

Should you do this?

No, this is a mediocre idea at best. One GPU and lots of system RAM? Absolutely, that’s a good idea! Separate LLM server that isn’t your desktop PC? Brilliant!

This is more “What can I expect if I do jam a GPU into all my PCIe slots?”

Running Deepseek on 3 GPUs

This Level1techs thread got me most of the way up and running. You have to build ik_llama.cpp, a llama.cpp fork with better hybrid CPU/GPU inference.

For the model, I downloaded the IQ1_S quant of Deepseek-R1-0528 here.. Seriously, a 1-bit quant and it’s kind of usable.

This is my llama-server command:

./build/bin/llama-server \
  --model models/DeepSeek-R1-0528-IQ1_S-00001-of-00003.gguf \
  --alias DeepSeek-R1 \
  -mla 3 -amb 512 \
  -ctk q8_0 \
  -c 32768 \
  --n-gpu-layers 99 \
  -ts 20,20,21 \
  -ot "blk\.([0-9]|1[0-2]|2[0-9]|3[0-2]|4[0-9]|5[0-3])\..*exps.*=CPU" \
  --parallel 1 \
  --threads 8 \
  --host 127.0.0.1 \
  --port 8080

–n-gpu-layers 99 just means “all the GPU layers” since Deepseek only has 61.

-ts 20,20,21 is a tensor split that evenly-ish divides layers across the 3 GPUs. There are 61 layers in this model.

-ot is a regex that offloads specific expert tensors to the CPU. Everything else is loaded onto the GPUs. Because of the way tensors are split, blk0-19 end up on CUDA0 (the first GPU), blk20-39 are on CUDA1, and blk40-60 are on CUDA2, or close enough to that anyway.

This mess of a regex offloads blk.0-12, blk.20-32, and blk.40-53 to CPU. That loads 22-23GB onto each GPU. It only matches exps tensors, which are experts that aren’t always used and apparently ideal for offloading to the CPU. You can offload more individual tensors to really full up the GPUs (e.g. Just blk.13.ffn_up_exps.weight will probably fit in CUDA0, even if all of blk13 doesn’t fit) but this runs so slow that I didn’t bother.

~ $ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:05:00.0 Off |                  N/A |
|  0%   30C    P8             20W /  350W |   22722MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:0B:00.0  On |                  N/A |
| 53%   35C    P8             52W /  390W |   22421MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:0C:00.0 Off |                  N/A |
|  0%   43C    P8             44W /  390W |   22512MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Is it usable?

Not really. I’d consider 6 tokens per second an acceptable reading pace and for code and thinking you want something much faster than that. I asked it some questions about common CLI tools and got ~2.2 tokens per second. It feels a little amazing that my office PC can hit any number at all on a 671 billion parameter model, but I don’t see myself turning my PC into a space heater to seriously use this.

I also pasted 13,000 tokens of terminal output and asked it to summarize it, and it took an incredible 19 minutes just to process the prompt before replying, so I think the VRAM I’m wasting for 32k context is quite excessive.

Undervolting

I use LACT to pseudo-undervolt my RTX 3090s (since the Nvidia driver doesn’t support voltage control).

Maximum GPU Clock: 1875Mhz
GPU P-State X Clock Offset: +150Mhz
VRAM P-State X Clock Offset: +1700Mhz

I just stole this guy’s config and it seems fine. It stops my fans from roaring during inference and probably saves on my electrical bill.

# Install LACT
sudo pacman -S lact

# Enable the lactd system service
sudo systemctl enable --now lactd

Set up the new clocks in the "OC" tab of the LACT GUI (click all the checkboxes to unhide the settings), then click Apply

Practicalities with using my regular PC for local LLM

I have a huge PC case with great airflow, but even with just 2 GPUs plugged into the motherboard, the fans get pretty noisy. I keep one GPU on the motherboard and another mounted vertically which keeps them quiet.

When I added a third GPU, I used a riser cable and keep it outside my case. It feels a little ridiculous to have such an enormous PC case and not put my components inside it, but it keeps the fan noise down.

Gaming

The second issue is gaming. With 2-3 GPUs plugged in many games will run poorly. I could not find any way to disable the GPU in BIOS, meaning the only way to get full performance back is to physically unplug the other GPUs. That is another reason my cards aren’t mounted on the motherboard. It’s easy to unplug the riser cables.

In a pinch, I can unbind the driver from my other GPUs with the below commands, which improves performance. However this doesn’t free up the PCIe lanes, so it’s not a complete fix:

lspci | grep VGA
05:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
0b:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
0c:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

# Unbind the driver for GPU2
echo 0000:0b:00.0 | sudo tee /sys/bus/pci/devices/0000:0b:00.0/driver/unbind

# Bind the driver for GPU2
echo 0000:0b:00.0 | sudo tee /sys/bus/pci/devices/0000:0b:00.0/driver/bind