Skip to main content

Running OpenClaw with Local GPU Inference on LiquidWeb (2026)

· 8 min read
Yassine El Haddad
Software Developer & Automation Specialist

I build production AI agents, web scrapers, and automation pipelines. Most of what I publish here comes from the actual problems they run into: proxies that get banned, anti-bot stacks that fingerprint your client, RAG that drifts when the underlying data moves. Stack: Python, TypeScript, Go, FastAPI, LangChain, Crawlee, Playwright, deployed on AWS, GCP, and Cloudflare.

Self-hosting OpenClaw with a cloud API backend is easy. But cloud APIs have costs that scale with usage, and they receive every message you send. If your team uses OpenClaw heavily, or if data privacy is a concern, local GPU inference solves both problems: your data stays on your hardware, and you pay a flat server rate instead of per-token fees.

This guide covers how to choose a LiquidWeb GPU server, set up Ollama or vLLM, and connect it to your OpenClaw instance.

When Local GPU Inference Makes Sense

Local inference is not always the right choice. Here is how to decide:

Stick with cloud APIs if:

  • Your team sends fewer than 500K tokens/day
  • You want frontier model quality (GPT-4o class) without managing hardware
  • Your budget is under $100/mo

Switch to local GPU inference if:

  • You send 500K+ tokens/day (L4 pays off vs Claude Sonnet)
  • Data privacy requires that messages never leave your servers
  • You want to use open-source models not available via cloud APIs
  • You need to run NemoClaw with a local vLLM provider

LiquidWeb GPU Tier Guide

All LiquidWeb GPU servers are single-tenant bare metal — no shared virtualization, full GPU access, pre-installed Docker with NVIDIA Container Toolkit.

NVIDIA L4 Ada (24 GB VRAM) — $0.80/hr

Best for: Development, 1–3 person teams, models up to 13–24B parameters.

The L4 is NVIDIA's data center inference chip, not a training GPU. It handles:

  • Llama 3.2 8B at full precision (fast, ~85 t/s)
  • Mistral NeMo 12B at Q8 (strong quality, ~55 t/s)
  • Llama 3.3 70B at 2-bit quantization (acceptable quality, 8 t/s)

At $0.80/hr and hourly billing, you can spin it up for a workday and shut it down when not in use. Monthly full-time cost: ~$576.

NVIDIA L40S Ada (48 GB VRAM) — $1.44/hr

Best for: Production teams, 70B models at good quality.

The L40S doubles the VRAM of the L4, which matters a lot for model quality. At 48 GB you can run:

  • Llama 3.3 70B at 4-bit (Q4_K_M): ~40 GB used, ~18–25 t/s
  • Llama 3.3 70B at 8-bit (Q8_0): ~70 GB (does not fit; use H100)
  • Mistral NeMo 12B at full precision: very fast (~120 t/s)

This is the sweet spot for most production OpenClaw deployments: strong model quality, reasonable cost, handles concurrent sessions.

NVIDIA H100 NVL (94 GB VRAM) — $2.98/hr

Best for: 70B at full precision, NemoClaw with Nemotron 120B, fine-tuning.

At 94 GB VRAM:

  • Llama 3.3 70B at Q8 (full quality): fits, runs at ~40–55 t/s
  • Nemotron 120B Super (NemoClaw default): fits at Q4
  • Concurrent vLLM sessions: excellent throughput with PagedAttention

The H100 is the correct tier if you use NemoClaw's NIM or vLLM providers at scale.

NVIDIA H200 NVL (141 GB VRAM) — $3.87/hr

Best for: 120B+ models at full precision, multi-user high-throughput, fine-tuning.

The H200 is NVIDIA's highest-memory inference chip as of March 2026. It handles models that do not fit anywhere else: Llama 405B at Q4, multi-modal 70B vision models, and long-context workloads with very large KV caches.

Provisioning Any Tier

  1. Go to LiquidWeb GPU hosting
  2. Select your tier
  3. Choose Ubuntu 22.04 LTS
  4. Add your SSH key
  5. Check out — provisioning takes ~15 minutes

SSH in and confirm GPU access:

ssh root@YOUR_GPU_IP
nvidia-smi

You should see your GPU, driver version, and VRAM reported.

Ollama is the simplest local inference server. It downloads models on demand, handles quantization automatically, and serves an OpenAI-compatible API.

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Verify GPU access:

ollama serve &
sleep 2
nvidia-smi # Should show Ollama consuming GPU memory

Choose and Pull a Model

For L4 (24 GB VRAM):

ollama pull llama3.2:latest # 8B, fast and capable
ollama pull mistral-nemo:12b # 12B, strong for instruction tasks

For L40S (48 GB VRAM):

ollama pull llama3.3:70b # 70B Q4 — near-frontier quality

For H100 (94 GB VRAM):

ollama pull llama3.3:70b-q8_0 # 70B Q8 — excellent quality

Connect OpenClaw to Ollama

Update ~/.openclaw/config.yaml:

llm:
provider: ollama
base_url: http://localhost:11434
model: llama3.3:70b

agent:
max_steps: 50
timeout_seconds: 600

If OpenClaw runs in Docker, use host.docker.internal instead:

llm:
provider: ollama
base_url: http://host.docker.internal:11434
model: llama3.3:70b

Restart the container and test in Telegram (or your connected platform):

cd ~/openclaw && docker compose restart

Benchmark Your Setup

Run a quick token throughput test:

time ollama run llama3.3:70b "Write a 500-word essay on distributed systems"

Expected throughput on L40S: 18–25 tokens/second. On H100: 40–55 t/s.

Option B: vLLM (For NemoClaw or High Concurrency)

vLLM uses PagedAttention to serve many concurrent requests efficiently. It is the correct choice if you use NemoClaw (which has a native vllm-local provider) or if multiple team members use OpenClaw simultaneously.

Get a HuggingFace Token

Most large models (Llama 3.3) require accepting a license on HuggingFace and using an API token. Set it up at huggingface.co/settings/tokens.

Start vLLM

export HF_TOKEN="hf_..."

docker run -d \
--name vllm \
--gpus all \
--restart unless-stopped \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}" \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 32768 \
--tensor-parallel-size 1

The first run downloads model weights (~140 GB for Llama 3.3 70B). On LiquidWeb's 10 Gbps network, this takes 10–20 minutes.

Watch the startup:

docker logs -f vllm
# Wait for: INFO: Application startup complete.

Verify it is serving:

curl http://localhost:8000/v1/models

Connect OpenClaw to vLLM

llm:
provider: vllm
base_url: http://localhost:8000/v1
model: meta-llama/Llama-3.3-70B-Instruct

Add NemoClaw on Top of vLLM

If you are using NemoClaw, set the inference provider to vllm-local:

openshell inference set --provider vllm-local \
--model meta-llama/Llama-3.3-70B-Instruct

NemoClaw handles the rest — requests to the agent runtime are sandboxed, and inference goes to your local vLLM server.

Cost Analysis: When Does GPU Pay Off?

At $0.03 per 1,000 output tokens (Claude Sonnet pricing) and 8-hour workday usage:

Tokens/dayMonthly Claude costL4 cost/moL40S cost/mo
100K$90$576$1,037
500K$450$576$1,037
1M$900$576$1,037
2M$1,800$576$1,037

The L4 breaks even at roughly 600K output tokens/month compared to Claude Sonnet. If you are anywhere near that volume, local inference pays for itself. For high-volume teams (2M+ tokens/month), the savings are substantial.

Note: These numbers assume you are doing inference only. Cloud APIs still have an advantage for infrequent use since you pay nothing when idle.

Apify Affiliate Banner 728x90Apify Affiliate Banner 728x90Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50
Hourly billing for bursty workloads

LiquidWeb bills GPU servers hourly. If your team only needs GPU inference during business hours, spin up the server at 9 AM and shut it down at 6 PM. Nine hours/day × 22 days = 198 hours/month × $0.80 = $158/mo for an L4 instead of $576 for 24/7.

Frequently Asked Questions

LiquidWeb's NVIDIA L4 Ada at $0.80/hr (~$576/mo 24/7). It has 24 GB VRAM and handles models up to 13B at full precision or 70B at heavy quantization. Billed hourly, so you can shut it down when not in use.

Ollama is easier to set up and great for 1–3 users. vLLM uses PagedAttention for higher concurrent throughput and is required for NemoClaw's native vllm-local inference provider. For personal use, start with Ollama.

Yes. Both Ollama and vLLM can run on the same GPU server as the OpenClaw gateway. Use localhost as the model endpoint in config.yaml. This is the simplest setup and works well unless the server is heavily loaded.

For most task types, Llama 3.3 70B (Q4 on L40S or Q8 on H100) gives near-frontier quality. For faster responses with slightly lower quality, Mistral NeMo 12B is excellent on the L4. Avoid models below 7B for complex agent tasks — they struggle with multi-step reasoning.

Yes. NemoClaw has a native vllm-local inference provider that routes requests to a local vLLM server. This is the recommended setup for NemoClaw deployments — it gives you sandboxed agent execution AND fully local inference.

Common mistakes and fixes

Ollama is running but OpenClaw cannot reach it from Docker.

When OpenClaw runs in Docker, use http://host.docker.internal:11434 as the Ollama base URL — not localhost. localhost inside a container refers to the container itself, not the host.

Model inference is slower than expected on the L4 tier.

Confirm the model is using the GPU: run nvidia-smi while inference is active — GPU-Util should show >0%. If it shows 0%, Ollama may have fallen back to CPU. Check: ollama ps (shows loaded models and their device). Ensure you have sufficient VRAM for the model — run a smaller quantization if needed.

vLLM container fails to start with CUDA error.

Verify the NVIDIA Container Toolkit is configured: docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi. If that fails, reinstall the toolkit: sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker.