cover

Conclusion: vAttention for Simplified, High-Performance LLM Inference

17 Jun 2025

This conclusion highlights vAttention's success in providing dynamic KV-cache memory management for LLMs without added software complexity

cover

Related Work: vAttention in LLM Inference Optimization Landscape

17 Jun 2025

Explore how vAttention distinguishes itself from prior LLM memory management (GMLake, PagedAttention) and scheduling systems, highlighting its unique approach

cover

vAttention: Highly Effective in Reducing LLM KV-Cache Fragmentation

17 Jun 2025

This section compares vAttention's superior approach to memory fragmentation and its significant portability advantage over vLLM

cover

vAttention: Efficacy of Physical Memory Allocation for LLMs

17 Jun 2025

This section demonstrates vAttention's ability to efficiently allocate physical memory for LLM serving, showcasing high bandwidth, and hidden CUDA API latency

cover

Boosting LLM Decode Throughput: vAttention vs. PagedAttention

13 Jun 2025

Discover how vAttention's use of FlashAttention's vanilla kernel for contiguous KV-cache delivers superior decode performance over paged kernelsi

cover

vAttention Performance & Portability for LLM Prefill Phase

13 Jun 2025

This section highlights vAttention's ability to add dynamic memory allocation support to unmodified FlashAttention and FlashInfer prefill kernels

cover

vAttention System Design: Dynamic KV-Cache with Contiguous Virtual Memory

13 Jun 2025

Explore the detailed system design of vAttention, which leverages separate virtual and physical memory allocation to enable dynamic KV-cache management

cover

Hiding Memory Allocation Latency in LLM Serving With vAttention

13 Jun 2025

Explore how vAttention optimizes LLM serving by leveraging predictable memory demand to overlap physical memory allocation with compute

cover

Evaluation of vAttention for LLM Inference: Prefill and Decode Performance

13 Jun 2025

This section details the evaluation methodology for vAttention, assessing its performance, portability, and memory efficiency in both prefill and decode phases