LLM-D Enters CNCF Ecosystem to Fix Kubernetes Gaps in AI Inference

The distributed inference framework, co-developed with Red Hat, IBM Research, CoreWeave, and NVIDIA, moves under Linux Foundation governance to standardize Kubernetes-based AI serving.

Google Cloud and its founding partners formally donated llm-d to the Cloud Native Computing Foundation on March 24, bringing the distributed AI inference framework under Linux Foundation governance as an official CNCF Sandbox project.

The move places llm-d alongside projects such as Kubernetes and Prometheus in the CNCF ecosystem, giving enterprises a vendor-neutral standard for deploying large language models across hardware from NVIDIA, AMD, and Google.

The framework was first launched in May 2025 as a joint effort between Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA under a single operating principle: any model, any accelerator, any cloud. Since then, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI have joined as partners, along with research backers at the University of California, Berkeley, and the University of Chicago.

Standard Kubernetes routing is stateless. It lacks awareness of KV-cache locality, prompt length, and the computational asymmetry between the prefill and decode phases of LLM inference. That gap produces cache fragmentation, uneven hardware utilization, and latency spikes under variable load.

llm-d addresses this through its Endpoint Picker, which acts as the primary implementation of the Kubernetes Gateway API Inference Extension. The EPP routes each request by evaluating real-time KV-cache hit rates, in-flight request counts, and instance queue depth simultaneously.

The framework also disaggregates prefill and decode into independently scalable pods and supports hierarchical KV-cache offloading across GPU, TPU, CPU, and storage tiers.

ALSO READ: Google Rolls Out Gemini AI in Chrome to India, Canada, and New Zealand

Google Cloud said its Vertex AI team validated the architecture in production before the CNCF submission. Using llm-d’s inference-aware routing for context-heavy coding tasks on Qwen Coder, Time-to-First-Token latency fell by more than 35%. For bursty chat workloads on DeepSeek, P95 tail latency improved 52%. Prefix cache hit rate doubled from 35% to 70%, reducing re-computation overhead and lowering cost-per-token.

In benchmark testing on Qwen3-32B across eight vLLM pods and 16 NVIDIA H100 GPUs, the project’s most recent v0.5 release maintained near-zero Time-to-First-Token under load and reached approximately 120,000 tokens per second, while a baseline Kubernetes service degraded under equivalent query volume.

The CNCF acceptance was announced during KubeCon and CloudNativeCon Europe 2026 in Amsterdam. The timing is deliberate. Enterprise adoption of generative AI has shifted the focus of platform engineering teams from model selection to inference infrastructure, and Kubernetes has emerged as the default orchestration layer for that workload.

Google also leads development of the Kubernetes LeaderWorkerSet API, which llm-d uses to orchestrate multi-node deployments and wide expert parallelism. The company separately extended vLLM natively for Cloud TPUs, citing up to 5x throughput gains over its initial release, through a unified PyTorch and JAX backend with Ragged Paged Attention v3.

Mistral AI confirmed participation in the project. Mathis Felardos, inference software engineer at Mistral AI, said in the CNCF announcement that the company is contributing to the llm-d ecosystem, including development of a DisaggregatedSet operator for LeaderWorkerSet, to advance open standards for AI serving.

It is unclear whether hyperscalers outside the founding group will adopt LLM-D as a production standard or develop parallel internal solutions. The project’s CNCF status does not mandate adoption.

Disclaimer: This reporting is based on publicly available reports, including CNCF and Google’s official blog post.

MORE ON AI
Europe Lost AI, Quantum is Next Unless We Act Now: Aneli Capital’s Daiva Rakauskaitė
Kandou AI Targets Data Center Bottlenecks With New Chips
India Partners With Google, YouTube to Train 15,000 Creators and Media Professionals in AI
Sysdig Launches Runtime Security for AI Coding Agents at RSA 2026