The Mechanics of PyTorch Inference

These are technical notes giving a quick introduction of the internals of inference in PyTorch. The goal is to communicate the basic pathway by which we are able to write models using a simple Python interface, while carrying out highly optimised inference on target hardware.

PyTorch was originally built as a research tool by Facebook AI Research around 2016. Its great strength was that it made neural networks feel like ordinary Python programs. Tensors behaved a lot like familiar NumPy arrays. Models could be written as normal classes and functions. You could use Python control flow, print statements, debuggers, and all the usual tools of exploratory programming. Wrap your model logic in subclasses of torch.nn.Module, define a forward method, and PyTorch would handle much of the machinery around parameters, state, and automatic differentiation.

That design was a huge advantage for research. When you are trying out new ideas, you want the model to be easy to change, easy to inspect, and easy to debug. Eager execution gives you that: Python remains in control, and operations run as the program reaches them.

Inference asks for something different. Once a model is trained, we do not need gradient tracking. We do not need mutable model state. We care less about interactive debugging and more about throughput, latency, memory use, and hardware utilisation. The flexibility that makes PyTorch pleasant for research can become overhead when the same model has to run repeatedly in production.

This creates the central tension in PyTorch inference. The model is written as a dynamic Python program, but efficient inference wants something closer to a static execution plan. It wants to know which tensor operations will run, how data will move between them, which intermediate results can be avoided, which operations can be fused, and which kernels should execute on the target hardware.

Modern PyTorch inference is largely about bridging that gap. It starts with ordinary eager execution, where Python dispatches individual tensor operations into compiled libraries. It then uses graph capture to recover a structured representation of the computation from Python code. Finally, it uses graph compilation to optimise that representation and lower it towards efficient kernels.

To understand the mechanics of PyTorch inference, we can follow a single forward pass down the stack: from Python code, to ATen operators, to captured graphs, to compiler transformations, and finally to the kernels that do the real work.

What’s the stack?

We’re going to work out what the layers of the stack really are, and along the way we’ll learn some of the techniques used to optimise inference. Let’s consider what an LLM consists of: an architecture (GLM, Qwen, Llama, etc) and some weights. The architecture is the shape of the network: a directed acyclic graph of operations on data that flows through the network until it reaches the output.

In PyTorch we usually write out the architecture in Python. Line-by-line we define modules such as attention mechanisms, multi-layer perceptrons, transformer blocks, etc, and we wire them together. Then we try to run a forward pass. The first problem is that Python is slow, and LLMs are represented by huge functions with billions of parameters. That would be a lot of work for the Python interpreter to carry out, so that work is usually dispatched elsewhere.

Delegating to ATen

Despite having py in the name, PyTorch is mainly written in C++. Most of the Python that you interact with is a thin wrapper around a compiled C++ core that defines the key operations as well as how to track gradients during training. The main library to consider there is ATen (a Tensor library) which defines the key operations such as matrix multiplication, vector addition, and activation functions, as well as many more.

When we run a forward pass, each line of Python may dispatch to one of these compiled operations. Not all of them do, but typically the most important ones such as matrix multiplication, activation functions, convolution, etc will have an implementation in C++. At some point in the Python call-stack there will be a function that binds to a shared library of ATen operations, some work will be done, and then it will be returned to Python.

So how well does this perform? It’s enough for a lot of scripting and experimentation. But serving has a high bar. If you’re dealing with higher demand, limited hardware, and expensive running costs then passing control back-and-forth between Python and C++ is not a luxury you have. The only reason that we do it at first is because we are operating in “eager-mode”. We basically don’t have a plan for where we are going and we rely on the control-flow of the Python program to direct us to the right operations as they come.

This isn’t really necessary. Once we have written our model and settled upon it, we already know all the operations we need to perform. We can write them as a graph of nodes representing data, and edges representing functions, and with this plan in hand there’s no need to pass anything back to Python until the end of the graph.

Eager mode returns control to Python after each operation, while a captured graph runs the whole sequence as one plan before returning.

So the next problem: we don’t have a graph yet. We only have PyTorch code written in Python. How do we translate that into a graph? This is done via graph capture using TorchDynamo. While sending some sample input into our Python program, Dynamo is able to observe the byte code that the interpreter runs, and from this it can construct the graph representation of the model. Each node in the resulting graph will be a call to an ATen operator. Sometimes the program will hit some branching logic or other control flow that is hard for Dynamo to capture. In these cases it may create a graph-break and two more graphs will be captured, stitched together with some Python logic to handle that dynamic control flow.

Once we have the graph, it’s time to figure out how to run it. That’s the job of the compiler.

Graph Capture and Compilation

Compilation is the process of taking high level instructions written in one format, and transforming them to some lower level format closer to where they will be run. The benefit of this is that it is often quite difficult for humans to write in that low level format directly (e.g. assembly, machine code, etc). Instead we benefit from working in another representation such as C++ (compiled), Python (interpreted), or nowadays natural language (AI-assisted).

Compilers have existed since the early 1950s when Grace Hopper and team developed the A-0 System, an early program that translated symbolic instructions into machine code. Is that relevant for neural networks today? Basically a compiler goes from “What do we want to get done?” to “How do we do it?” In this frame, the graph of operations representing a neural network is the “what” and we need to figure out the “how” in the most efficient manner. How do we run a neural network on our chosen hardware as fast as possible? Answering this question is the job of a graph compiler.

Some examples. Suppose we have activations from a previous layer, and now we want to apply a matrix multiplication, add a bias, and pass the output through another activation function. This is a very common pattern throughout neural networks: a linear layer followed by a nonlinearity. These three operations can be written in code as:

import torch.nn.functional as F

def model(x, w, b):
    z = x @ w      # matmul
    z = z + b      # elementwise add / bias
    y = F.gelu(z)  # elementwise activation
    return y

Kernel fusion: the unfused path bounces each intermediate through DRAM, while the fused kernel keeps them in registers and writes the result once.

So what’s the problem and how can a compiler fix it? Without compilation, each of these three lines is a separate call to an ATen operator: first matmul, then elementwise addition, and finally elementwise GeLU activation. Each operator has some input data, and produces some output data. The input data must be fetched from memory: either on-chip such as L1, L2, or L3 caches on a CPU, or from main memory. Ideally the data will be available on-chip since this has much faster access, but that is not guaranteed. If the result of the matrix multiplication is written out as a full tensor, then read back in for the bias addition, then written out again, then read back in for the activation function, we have moved a lot of data simply to perform a small amount of extra elementwise work.

This is especially wasteful because the bias addition and activation function are not independent. They are just transformations applied to each output element of the matrix multiplication. Instead of treating them as separate stages, a compiler can combine them into one piece of generated code. In the ideal case, each output element is computed, the bias is added, the activation is applied, and the final value is written once.

This is kernel fusion and it is one of the responsibilities of the TorchInductor graph compiler. Rather than launching several kernels and materialising intermediate tensors between them, the compiler tries to produce a smaller number of kernels that do more work per pass over the data. This can reduce memory traffic, reduce launch overhead on GPUs, and improve cache locality on CPUs.

There are some details hidden in that sentence. A large matrix multiplication is usually already handled by a highly tuned library such as cuBLAS, oneDNN, or MKL. These libraries contain carefully written kernels that know how to use the target hardware efficiently. A compiler will not necessarily replace those kernels with its own version. Often the best plan is to keep the matrix multiplication as a library call, while fusing cheaper elementwise operations around it or into an epilogue if the backend supports that.

So TorchDynamo captures the Python program into a graph of tensor operations. TorchInductor then takes that graph and decides how to lower it. Some operations may be fused. Some may be simplified. Some intermediate tensors may be removed. Some parts may be compiled into generated kernels, while others may remain calls into existing vendor libraries.

On NVIDIA GPUs, generated kernels may be emitted through Triton. On CPUs, PyTorch may generate C++ code using vectorised loops and threading. On other hardware, different backends and libraries come into play. This means that graph compilation is not a single optimisation. It is a pipeline of decisions: preserve this operation, fuse those operations, specialise for these input shapes, choose this backend, generate this code, and call this library where it is already better than anything we should generate ourselves.

Overview

So the stack looks something like this. At the top, we have ordinary Python code defining a model. In eager mode, Python drives execution operation by operation. Each important tensor operation dispatches into ATen, where compiled implementations do the real numerical work. With torch.compile, Dynamo observes the Python execution and captures a graph. Inductor then optimises that graph and lowers it towards efficient kernels. Finally, those kernels execute on the target hardware: CPU vector units, GPU streaming multiprocessors, matrix engines, memory hierarchies, and vendor libraries.

A forward pass moves down the stack, with each arrow showing the tool that bridges one layer to the next.

This is the basic mechanics of PyTorch inference. The pleasant research interface remains Python, but the performance-critical path is progressively pushed downward. First into C++. Then into graphs. Then into compiler transformations. Then into generated kernels or tuned libraries. The goal is to keep the model easy to write while making the repeated execution of that model look less like a dynamic Python program and more like a carefully scheduled computation.

For LLM serving, this compiler stack is only part of the story. Large language models add their own systems problems: batching requests, managing the KV cache, choosing attention kernels, quantising weights, arranging memory layouts, and keeping expensive hardware busy during both prefill and decode. Frameworks such as vLLM, TensorRT-LLM, llama.cpp, and others often specialise heavily around these serving concerns. But the central idea is the same. Inference performance comes from turning flexibility into structure. PyTorch begins with a model written as ordinary Python. The inference stack tries to recover the plan hidden inside that Python program, optimise it, and lower it to the hardware without asking the user to write kernels by hand. That is the bargain PyTorch is trying to strike: keep the programming model expressive, but make the execution path increasingly static, compiled, and hardware-aware.