- 
        
        Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM
- 
        
        From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA
- 
        
        Now Serving NVIDIA Nemotron with vLLM
- 
        
        No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL
- 
        
        vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU
- 
        
        SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference
- 
        
        DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action
- 
        
        The First vLLM Meetup in Korea
- 
        
        vLLM Semantic Router: Next Phase in LLM inference
- 
        
        vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency
- 
        
        Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM
- 
        
        Inside vLLM: Anatomy of a High-Throughput LLM Inference System
- 
        
        Introduction to torch.compile and How It Works with vLLM
- 
        
        GLM-4.5 Meets vLLM: Built for Intelligent Agents
- 
        
        CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond
- 
        
        vLLM Now Supports gpt-oss
- 
        
        MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference
- 
        
        Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU
- 
        
        Accelerating RLHF with vLLM, Best Practice from OpenRLHF
- 
        
        Transformers backend integration in vLLM
- 
        
        Llama 4 in vLLM
- 
        
        PTPC-FP8: Boosting vLLM Performance on AMD ROCm
- 
        
        Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM
- 
        
        Distributed Inference with vLLM
- 
        
        vLLM V1: A Major Upgrade to vLLM's Core Architecture
- 
        
        Introducing vLLM Inference Provider in Llama Stack
- 
        
        High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”
- 
        
        Structured Decoding in vLLM: a gentle introduction
- 
        
        vLLM 2024 Retrospective and 2025 Vision
- 
        
        Installing and Developing vLLM with Ease
- 
        
        Serving LLMs on AMD MI300X: Best Practices
- 
        
        How Speculative Decoding Boosts vLLM Performance by up to 2.8x
- 
        
        vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction
- 
        
        vLLM’s Open Governance and Performance Roadmap
- 
        
        Announcing Llama 3.1 Support in vLLM
- 
        
        Notes on vLLM v.s. DeepSpeed-FastGen
- 
        
        vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention