These notes give an overview of how vLLM works from the perspective of processes, their responsibilities, and how they communicate. This is quite surface level. We don’t get down to hardware details or kernels here, and there’s also a lot we could go into relating to how different types of parallelism are used. But by the end we should be able to describe what happens to a request as it makes its way through the system, and what comes out at the end.
When working on performance optimisation in vLLM, the number of possible execution paths and configurations can be quite striking. What kind of device are we targeting? Each device could exercise very different pathways in the codebase and its dependencies. Is it a CPU, a GPU, or something else? These have completely different kernel implementations. Above that, how are we dividing the model across our compute resources? Are we doing it most efficiently? How does it even receive jobs?
vLLM is a pretty big project. Including tests it has over 1 million lines of code. In order to figure out the most relevant areas for optimisation, you will most likely need to apply some kind of profiling and figure out what the bottlenecks are. But is this enough? While profiling does surface the most relevant code, it can also lead to a kind of tunnel vision that prevents you from deeply understanding the behaviour of the system and what governs its performance. To make the best progress, you’ll need a bit of both.
How is vLLM structured?
A default vLLM deployment often consists of two or three processes:
- The API server which handles requests via HTTP,
- An EngineCore which manages a schedule of tasks for the model,
- A Worker which picks up scheduled tasks and runs the model. On a single accelerator this runs inside the EngineCore process; under tensor or pipeline parallelism each Worker becomes its own separate process.
In practice there could be multiple instances of any of these processes when we are using different types of parallelism to help manage the load.
- For example under heavy load we might have multiple API servers handling requests, but they could send them on to the same EngineCore for scheduling.
- We could also replicate the model across different nodes or accelerators. This would be an example of data parallelism and we would have separate EngineCore processes to manage the schedules of each model replica.
- Finally, each EngineCore could split its model across multiple Worker processes, each running on a different node or accelerator. This can be done via either pipeline parallelism, in which different layers of the model are handled by different workers, or by tensor parallelism, in which a single layer can be divided across workers and the outputs can be collected via a reduction or gather operation.
Why do we need separate processes?
Typically you don’t want to start spawning multiple processes if you can avoid it because it introduces a lot of extra complexity. How do your processes talk to each other? How do they exchange data efficiently while keeping track of who owns what and cleaning up when appropriate? The extra difficulty of dealing with interprocess communication and lifecycle management is certainly a cost. So why have separate processes?
First let’s consider the API server and the EngineCore. These components need to be separated because they work at very different timescales. The server could be receiving a huge number of requests per second from many concurrent users, and it must not block on any of them. Work has to be balanced, batched, and scheduled, and any one user request cannot be allowed to halt the program while we respond to it. So for this reason the API server only validates incoming requests, and sends them onwards to the EngineCore where the heavier work is done.
Now let’s consider how to break down the responsibilities of an EngineCore. An EngineCore is responsible for listening for work from the API server, deciding on priorities (i.e. scheduling), and then carrying out a piece of that work before checking for new work. Whereas separation of the server from the EngineCore was mainly to deal with the differences in the rate at which they need to respond to their callers, subdivision of the EngineCore into worker processes has a different purpose. It is mainly to parallelise and spread the inference workload over cores or accelerators in order to reduce the time it takes to generate tokens.
How does the server communicate with EngineCore?
Since the API server and the EngineCore are separate processes, there needs to be some way for them to communicate. This should be in both directions since the API server is the one originally receiving the prompts to which the EngineCore needs to respond, before that response is sent back to the server for return to the user.
How do they do it? The API server communicates with the EngineCore using a library called ZMQ, which provides a socket-like interface for the processes to send and receive messages, while hiding the details of exactly how those messages are transported. This could be done via Unix domain sockets if the processes are on the same host, or, if they are on separate hosts, via TCP over a network.
This solves the problem of communicating across an async/sync boundary. The server runs an event loop to receive, validate, and forward requests; it must never block. The EngineCore runs a synchronous loop that processes work one step at a time, and blocks while it computes. ZMQ sits between them as a non-blocking, buffered pipe: the server drops a message onto the socket and returns instantly to its loop, while the EngineCore picks each one up when it’s ready and processes it synchronously. The buffer decouples their two timing models, so the fast async frontend never has to wait on the slow synchronous engine.
How does an EngineCore keep up?
So far we’ve emphasized that the EngineCore is much slower than the server. The server may be receiving a firehose of work from users and its main job is to send that as quickly as possible over to the EngineCore. Is there any hope that the engine will be able to manage this?
The main way this is resolved is via batching. The engine doesn’t have to process a single request at a time. It is able to group them together into a batch that can be processed simultaneously. You might wonder why this works? If a single request took X milliseconds to process and we batch B of them together, won’t it take B*X milliseconds since it is B times the original work? It certainly is more work for the compute units to handle, but it turns out they are often not fully utilised anyway, because most of the time is spent moving model weights from memory into the compute units rather than doing arithmetic. By batching together work, we are able to reuse the weights of our model many times across multiple requests before they need to be evicted from memory.
Memory turns out to be central to vLLM’s performance in another way too. As each request generates tokens, the model builds up a growing cache of intermediate state, known as the KV cache, that has to be stored and managed efficiently alongside the weights. How vLLM does this is one of the things it is best known for, but it is a large enough topic that we will set it aside here and return to it in a separate post.
What happens to a request?
We now have all the pieces, so let’s follow a single request from start to finish.
A user sends a prompt to the API server over HTTP. The server’s job is light: it validates the request, turns the prompt into tokens, and packages it up as a piece of work. It then drops that work onto the ZMQ socket and immediately returns to its event loop, ready for the next caller. At this point the request is out of the server’s hands and sitting in the buffer, waiting for the engine.
The EngineCore picks it up on its next pass around its loop. Rather than handling it in isolation, the scheduler folds the new request in with all the other in-flight requests and decides what should run on this step. That decision becomes a batch, and the batch is handed to the Worker, or split across several Workers when we are using tensor or pipeline parallelism.
The Worker is where the model actually runs. It begins with the prefill, processing the prompt to produce a single new token; for a long prompt this prefill may itself be split across several steps. From then on the request is in the decode phase: each step produces exactly one more token, which is appended to the sequence and fed back in so the next step can produce the token after it. This is the autoregressive loop, and it is why generation happens one token at a time rather than all at once.
Each token the Worker produces is passed back up to the EngineCore, and from there back over ZMQ to the API server. The server turns the token back into text and streams it out to the user, which is why you see a response appear word by word rather than in a single block. The request stays in the engine’s schedule, taking one decode step per pass, until the model emits an end-of-sequence token or hits the requested length limit. At that point the request is finished, it is dropped from the schedule, and the resources it was holding are freed for the next caller.
So that is what comes out at the end: a stream of tokens, generated one step at a time by the Worker, scheduled and batched by the EngineCore, and relayed back through the API server to the user who asked for them.