Lee Harden
All posts
gcpcloud-rungpullmvllm

Deploying Gemma 4 on Cloud Run GPUs: A Trip Report

What the codelab doesn't tell you. Five gotchas from an afternoon of getting Gemma 4 31B running on Cloud Run's new RTX PRO 6000 instances.

8 min readby Lee Harden

What the codelab doesn't tell you.

Google quietly launched Cloud Run GPU support for NVIDIA RTX PRO 6000 (Blackwell) earlier this year, and with it a prebuilt vLLM image tuned for the flagship Gemma 4 31B model. The pitch is hard to argue with: a 31-billion-parameter frontier-ish model, scale-to-zero billing, no VM to babysit, no Kubernetes to run.

I spent an afternoon getting it working from scratch. The happy path in Google's official codelab is real, but it left out five gotchas that each cost me a deploy iteration or a puzzled stare at logs. This is the version I wish I'd read first.

Why Cloud Run GPU for this at all?

A 31B model needs serious silicon: ~60GB of weights, FP8 quantization, and enough VRAM headroom for a usable KV cache. The usual answer is a GCE VM with an A100 or L4 attached, billed by the minute whether you're using it or not. For a bursty workload — a Slack bot, an eval harness, a sometimes-on API — that's pure overhead.

Cloud Run GPU gives you the same hardware on a request-driven model. An instance spins up on the first request, serves until idle, then terminates. Your monthly bill tracks actual usage rather than "I forgot to turn the VM off last Tuesday."

The tradeoff is cold-start latency. More on that later.

The architecture

  HuggingFace  --one-time Cloud Build--> GCS bucket
                                          |
                                          | runai_streamer (streaming load)
                                          v
  Client  --HTTPS-->  Cloud Run (RTX PRO 6000, vLLM + Gemma 4 31B)

The weights are pre-staged to a GCS bucket once, then streamed into the Cloud Run container at startup via Run:ai's model streamer. This avoids re-downloading 58GB from HuggingFace every cold start, which would be both slow and bandwidth-expensive.

Step 1: Stage the weights to GCS

The codelab hands you a gcloud builds submit invocation that uses huggingface_hub to pull the model and uploads it to a GCS bucket. Cloud Build is the right tool here because the VM has fast disk and fast egress to GCS, both of which you don't get on a typical workstation.

Gotcha #1: E2_HIGHCPU_32 has zero quota by default

The codelab sets machineType: 'E2_HIGHCPU_32'. In most projects/regions that quota is zero. You'll see:

FAILED_PRECONDITION: failed precondition: due to quota restrictions,
Cloud Build cannot run builds of this machine type in this region

You can request quota, or just omit the machineType line entirely. Default Cloud Build VMs are slower but always available. For a 58GB one-time download, default took ~12 minutes — fine.

Gotcha #2: gcloud storage cp -r drops the parent directory name

The build step ends with:

gcloud storage cp -r "./model-cache/$_MODEL_NAME" "$_GCS_MODEL_LOCATION"

With MODEL_NAME=google/gemma-4-31B-it, you'd expect the files to end up under gs://bucket/model-cache/google/gemma-4-31B-it/. They don't. cp -r treats the google/ part of the source path as a directory to descend into, and the destination ends up as gs://bucket/model-cache/gemma-4-31B-it/ — no google/ prefix.

The codelab's downstream vllm serve command points at the prefixed path, so it 404s. The fix is either a second gcloud storage mv to add the prefix, or — simpler — just point vLLM at the path that actually exists:

vllm serve gs://bucket/model-cache/gemma-4-31B-it \
  --served-model-name google/gemma-4-31B-it ...

--served-model-name is a display alias; clients can still address the model as google/gemma-4-31B-it. Decoupled from the on-disk path.

Step 2: Deploy to Cloud Run

This is the step where you meet the RTX PRO 6000. Key flags:

--gpu 1 --gpu-type nvidia-rtx-pro-6000 --no-gpu-zonal-redundancy
--cpu 20 --memory 80Gi
--no-cpu-throttling --cpu-boost
--max-instances 1 --concurrency 64

Gotcha #3: --vpc-egress all-traffic breaks GCS access

The codelab attaches the service to a VPC with --vpc-egress all-traffic. If you're using the default VPC (as most of us are), it has no Cloud NAT gateway. All outbound traffic — including to storage.googleapis.com — tries to route through the VPC, can't find a path to the public internet, and the Run:ai streamer dies with:

NewConnectionError: [Errno 101] Network is unreachable

Two fixes:

  1. Add a Cloud NAT to the VPC (production-grade, worth doing eventually).
  2. Change egress to --vpc-egress private-ranges-only so only RFC1918 destinations go through the VPC; public endpoints like GCS bypass the VPC and use Cloud Run's default internet egress.

For a first deploy I went with option 2. Worked immediately.

Gotcha #4: The codelab's startup probe is 6 minutes too short

vLLM's HTTP server only starts listening on port 8080 once the model is fully loaded. With 58GB streaming from same-region GCS via runai_streamer, that load takes ~20 minutes on a cold container.

The codelab's probe:

--startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=40,timeoutSeconds=10,periodSeconds=15

That's 240 + (40 × 15) = 840 seconds = 14 minutes of total startup budget. You watch the logs tick up to 99% loaded, hit the 14-minute deadline, and Cloud Run kills the container. Ask me how I know.

Fix: bump failureThreshold to 160, giving 240 + (160 × 15) = 2640 seconds = 44 minutes. That's comfortable headroom. The probe only keeps polling while the container is starting, so over-provisioning it costs nothing.

Gotcha #5: bash -c with embedded semicolons needs gcloud's ^;^ syntax

The vLLM command line has a lot of flags, and gcloud run deploy needs them passed to the container as bash -c "<command>". The naive form breaks because gcloud parses , as an argument separator for --args.

The documented-but-obscure fix is gcloud's custom-delimiter syntax:

--command bash --args='^;^-c;vllm serve ... --max-num-seqs 8 ...'

The ^;^ at the start means "use ; as the delimiter for this argument list." So the container gets exactly two arguments: -c and the full vllm command. Without this you'll see gcloud complain about unrecognized flags, or — worse — it'll silently pass only part of the command.

Step 3: Confirm it works

Once deployed, hit /v1/chat/completions with an OpenAI-shaped payload. vLLM exposes the OpenAI API faithfully, so any OpenAI client library works out of the box. Three things worth verifying before you trust it:

  1. Basic chat completion. Send a short prompt, confirm you get coherent output.
  2. Tool calling. Gemma 4 supports tool use, but only if you start vLLM with --enable-auto-tool-choice --tool-call-parser gemma4. Send a request with a fake tool def and check that message.tool_calls comes back populated, not as text.
  3. Multi-turn history. vLLM is stateless — the client must send the full message history each turn. Confirm the model respects it (send "my lucky number is 42," then ask what it is).

After the cold-start penalty, warm latency is respectable: ~2s for a basic chat completion, sub-second for a tool-call emission, ~0.2s per follow-up turn once the KV cache is populated.

The honest numbers

  • 58GB of weights on disk (fp8 quantization is applied at load time; the on-disk format is bfloat16).
  • ~20 minutes cold-start while runai_streamer pulls from GCS.
  • ~2 seconds per warm chat completion on a short prompt.
  • ~$1.50/month for the GCS bucket.
  • ~$0.001/second for the GPU instance while warm (check current Cloud Run GPU pricing — it moves).
  • Scale-to-zero means the bill stops between requests. The 20-minute cold start is the price you pay for that.

For an interactive Slack bot or a nightly eval harness, the math is great. For a low-latency consumer API, you'd want a min-instances > 0 (which defeats the whole cost story).

When not to do this

  • If you need first-token latency under 5 seconds from a cold queue, the 20-minute cold start is a dealbreaker.
  • If you're pushing > 50 QPS sustained, you'll hit scaling limits fast — a dedicated GKE cluster or a Vertex endpoint makes more sense.
  • If you want fine control over the inference loop (custom samplers, speculative decoding knobs), wrapping vLLM in Cloud Run hides some of that.

What I'd change next time

The VPC dance is clunky for a greenfield project. If I were doing this again in a new project, I'd skip the VPC attachment entirely (drop --network and --subnet) and let Cloud Run use its default egress. private-ranges-only only mattered because I was using the default VPC for other services and gcloud didn't like detaching.

Also: the google/ prefix drop could be fixed at the Cloud Build step by piping through an mv or by building the GCS path with the prefix explicit. I'm leaving mine unprefixed for now; future deploys can copy the pattern.

tl;dr checklist

  1. Don't use E2_HIGHCPU_32 for the build VM unless you've already got quota.
  2. Stage weights to GCS, confirm the actual path doesn't include the google/ prefix.
  3. Deploy with --vpc-egress=private-ranges-only (or no VPC).
  4. Set --startup-probe failureThreshold=160 or higher for the 31B model.
  5. Wrap the vllm command in --command bash --args='^;^-c;<cmd>'.
  6. Smoke test chat, tool calls, and multi-turn before wiring it into anything real.

Total first-deploy time: about 90 minutes, half of it wall-clock waiting on Cloud Build and startup probes. Every subsequent deploy is ~5 minutes (weights are cached in GCS). Not bad for a 31B model that costs pennies between requests.