Loading...
Discovering amazing AI tools

LMCache is an open-source KV cache layer that speeds up LLM inference by storing and reusing KV caches across GPU, CPU, disk, and S3.
LMCache is an open-source KV cache layer that speeds up LLM inference by storing and reusing KV caches across GPU, CPU, disk, and S3.
LMCache is an open-source KV cache management layer for LLM inference that turns the KV cache from temporary state into reusable, AI-native knowledge. It stores KV caches of reusable text across the datacenter - GPU, CPU, local disk, and S3 - using acceleration techniques such as zero CPU copy, NIXL, and GDS, so prefixes never need to be recomputed across requests or serving engines. It is vendor-neutral and plugs into mainstream open-source serving engines, inference frameworks, hardware vendors, and storage systems. Combined with vLLM, LMCache delivers 3-10x reductions in delay and GPU cycles for workloads like multi-round QA and RAG, cutting time-to-first-token and improving throughput. A flexible SERDE interface lets researchers add compression, token dropping, and custom serialization.
Compare LMCache: vs Agent-Reach · vs Headroom · vs SkillSpector · vs Fonda