Disaggregated Prefill and Decode (DPD) deployment issues
Overview: Common issues that can occur when deploying inference endpoints with Disaggregated Prefill and Decode (DPD). These problems typically involve pod startup, KV cache transfer, routing behavior, or resource allocation.
Pod startup issues
Problem: DPD pods fail to start or remain in a non-ready state.
Symptoms and resolution:
-
Pods stuck in
ContainerCreatingfor more than 10 minutes. Runkubectl describe pod <pod-name>and look forFailed to pull imageorMountVolume.SetUp failed. Verify the worker image exists and the Amazon S3 bucket is accessible from the cluster. -
Pods stuck at 2/3 Ready. The vLLM worker is still loading the model. Llama 3.3 70B takes 5–10 minutes from a cold Amazon S3 fetch. Check progress:
kubectl logs <pod-name> -c <prefill|decode>-<endpoint-name> | grep -i "engine\|loading"Wait for the log message indicating the engine is ready.
-
Pods restart with
EngineDeadErrororTimeoutError. This indicates an operator version older than v3.2. Upgrade the inference operator before continuing. -
All HTTP requests return 503 immediately after deploy. Pods are still loading the model. Wait for the
InferenceEndpointConfigstatus to reachDeploymentComplete:kubectl get inferenceendpointconfig <endpoint-name> -n <namespace> -w
KV cache transfer issues
Problem: KV cache transfer between prefill and decode pods fails or performs poorly.
Symptoms and resolution:
-
Decoder logs show
Retrieved 0 out of N required tokens. KV transfer did not occur and the decoder fell back to local recomputation. Verify that thepd_roleis correct (prefiller must besender, decoder must bereceiver), both pods use the same worker image, andPYTHONHASHSEEDis set to"0"on both pods. -
Decoder logs show
Failed to allocate memory object, retrying...The decoder PD buffer is full under high concurrency. Either increasePD_BUFFER_SIZE(try"17179869184"for 16 GiB or"34359738368"for 32 GiB) or scaledecodingSpec.replicas. -
KV transfer throughput below 1 GB/s. EFA is not being used and transfers are falling back to CPU. Verify that both pods are scheduled on EFA-capable nodes in the same Availability Zone, that nodes have EFA resources available (
kubectl describe node <node-name> | grep efa), and that the worker image includes the EFA libfabric provider.
Routing issues
Problem: Requests are not being routed correctly between prefill and decode pods.
Symptoms and resolution:
-
All requests bypass the prefiller (even long prompts). Check the router logs for routing decisions:
ROUTER_POD=$(kubectl get pods -n hyperpod-inference-system -o name | grep router | head -1) kubectl logs $ROUTER_POD -n hyperpod-inference-system -c router-container --tail=50 \ | grep "Conditional routing"Verify the
estimated_tokensvalue exceeds yourroutingThreshold. If the token estimate is lower than expected, the router's tokenizer may be counting differently — try loweringroutingThreshold. -
Uneven load distribution across prefillers. If you have multiple prefiller replicas and observe that one is overloaded while others are idle, switch the routing strategy to
roundrobinfor even distribution. Alternatively, usekvawarefor cache-aware distribution that accounts for the actual state of each prefiller.