News Details

🤖 AI Snack 🍿 : Managing Memory Like an Operating System to Improve LLM Performance 🧱

Explore how PagedAttention and clever memory management in vLLM are boosting the power and efficiency of LLMs, making them supercharged for chat and beyond!

Managing memory in language models is like an elephant organizing its knowledge, employing a clever strategy to maximize its memory.

Classic computer science concepts like virtual memory and paging are finding new applications in AI to improve the performance of large language models (LLMs). In LLMs, every word in the prompt and response consumes memory - for example, around 2.5MB per word in Meta Llama 2. The memory allocated to store these word representations is called the "key-value (KV) cache." Efficiently managing the KV cache is critical, given the limited memory on GPUs.

Researchers at UC Berkeley proposed PagedAttention, which stores words in non-contiguous blocks of memory, similar to how operating systems use paging to manage physical memory. Previously, if a model had a context length of 2048 words, 2048 contiguous memory blocks would be allocated per request, causing fragmentation from unused capacity. With PagedAttention, words can be stored flexibly in blocks of 4, significantly improving requests per second. It also enables sharing prompt word blocks across requests, further reducing memory usage.

From Efficient Memory Management for Large Language Model Serving with PagedAttention paper

When GPU memory fills up, PagedAttention uses a virtual memory technique - it swaps some KV cache shards to CPU RAM as "swap space." These shards stay in swap space until GPU memory is freed up, then they are moved back for continued text generation. An alternative is recomputing the swapped shards rather than storing them, depending on GPU specs and GPU-CPU bandwidth.

UC Berkeley implemented PagedAttention in vLLM, an open-source LLM serving engine. vLLM with PagedAttention achieves 2-4x higher throughput than previous systems. It also has optimizations like continuous batching and kernel-level improvements for excellent performance serving LLMs. We have chosen vLLM as the inference engine for ChatFAQ because it enables easy and efficient LLM deployment through our Docker container, which you can find in our github repo.

Now, you are ready to run open-source models at full speed!

To explore further, consider delving into the original paper—it's where the real adventure begins!