A sweeping architectural overhaul is underway inside one of the most widely used open-source AI inference engines, and the implications for developers, hardware vendors, and the broader machine learning community could be profound. The project, detailed in a GitHub discussion posted by core maintainer Diego Devesa (slaren), proposes replacing the current computational graph management system in llama.cpp with a fundamentally new approach—one that decouples model logic from backend execution and introduces a persistent, reusable graph scheduler.
For those unfamiliar, llama.cpp is the C/C++ inference framework originally created by Georgi Gerganov that allows large language models to run efficiently on consumer hardware, including CPUs, Apple Silicon, and various GPUs. It has become a cornerstone of local AI deployment, powering dozens of downstream applications and serving as a reference implementation for quantized model inference. Any major change to its internals sends ripples across the open-source AI world.
The Problem With the Current Architecture
According to Devesa’s detailed technical writeup on GitHub, the existing system rebuilds the computational graph on every inference call. Each time a user sends a prompt or generates a token, llama.cpp constructs a new graph from scratch, allocates memory, assigns operations to backends, and then executes. This approach, while functional, introduces overhead that becomes increasingly painful as models grow larger and multi-backend configurations become more common.
The current design also tightly couples model definition code with backend-specific concerns. Developers writing model architectures in llama.cpp must be aware of memory allocation strategies, tensor placement across devices, and graph splitting logic. This makes adding new models or new hardware backends more difficult than it needs to be. As Devesa writes, the goal is to make it so that “the model graph code does not need to know about the backends at all.”
A Persistent Graph That Survives Between Calls
The proposed replacement centers on what Devesa calls a “graph scheduler” (referred to internally as ggml_sched_v2 or the new scheduler design). Rather than rebuilding everything each inference step, the new system would create the computational graph once and then reuse it across multiple calls, updating only the parts that change—such as input tokens or KV-cache positions. This persistent graph model eliminates redundant work and opens the door to more aggressive optimization.
The new scheduler would handle all backend assignment, memory planning, and graph partitioning internally. Model code would simply declare the mathematical operations needed—matrix multiplications, attention computations, normalization steps—without specifying where or how they execute. The scheduler then analyzes the graph, determines optimal placement of each operation across available backends (CPU, CUDA, Metal, Vulkan, etc.), and manages data transfers between devices automatically.
Why This Matters for Multi-GPU and Heterogeneous Hardware
One of the most significant motivations for this redesign is better support for multi-backend execution. Today, running a model split across, say, a GPU and a CPU requires careful manual configuration and results in graph-splitting logic that is fragile and hard to maintain. The new scheduler would treat multi-device execution as a first-class concern, automatically inserting data copy operations where needed and optimizing the execution order to minimize synchronization stalls.
This is particularly relevant as the hardware environment for local AI inference becomes more diverse. Users are running models on combinations of NVIDIA GPUs, AMD GPUs via Vulkan and ROCm, Apple M-series chips via Metal, and even specialized NPU accelerators. A clean separation between model logic and execution strategy makes it far easier to support new hardware without rewriting model code. As Devesa notes in the discussion thread, the current approach of having model code “manually manage tensor placement” is becoming unsustainable as the number of supported backends grows.
Static Shapes, Dynamic Inputs: A Compiler-Inspired Approach
The design borrows concepts from modern ML compilers like XLA and TVM. By treating the graph as a static structure with dynamic input values, the scheduler can perform optimizations that are impossible when the graph is rebuilt every time. These include operation fusion (combining multiple small operations into a single kernel launch), memory reuse planning (allocating the same buffer for tensors whose lifetimes don’t overlap), and pre-computed execution schedules that avoid runtime analysis overhead.
Devesa’s proposal also addresses a long-standing pain point: memory allocation. The current system uses a combination of scratch buffers and manual allocation that has led to numerous bugs and performance issues over the project’s history. The new scheduler would implement a proper memory planner that analyzes the full graph, determines the minimum memory required, and assigns buffer offsets at schedule time rather than execution time. This should reduce peak memory usage and eliminate an entire class of allocation-related bugs.
Impact on Model Developers and Downstream Projects
For developers who contribute model implementations to llama.cpp—and there are many, given the framework’s support for architectures ranging from LLaMA and Mistral to Qwen, Gemma, and Command R—the new design promises a substantially simpler programming model. Instead of worrying about which backend a tensor lives on or how to split attention heads across devices, developers would write pure computational graph code. The scheduler handles the rest.
This simplification could accelerate the pace at which new model architectures are added to the project. Currently, adding a new model requires understanding not just the model’s mathematics but also llama.cpp’s internal memory management and backend dispatch systems. Reducing that burden to just the mathematical description would lower the barrier to contribution significantly. Downstream projects like Ollama, LM Studio, Jan, and koboldcpp, which build user-facing applications on top of llama.cpp, would benefit from improved performance and stability without needing to change their own code.
Community Response and Open Questions
The GitHub discussion has attracted significant attention from the llama.cpp contributor community. Several participants raised questions about how the new scheduler would handle models with genuinely dynamic graph structures—such as mixture-of-experts architectures where different experts are activated for different tokens. Devesa acknowledged this challenge, noting that the initial implementation would focus on models with static graph shapes (which covers the vast majority of current transformer architectures) and that dynamic graphs would be addressed in a subsequent phase.
Other contributors asked about the transition plan. A change this fundamental cannot be landed in a single pull request without breaking existing functionality. Devesa indicated that the new scheduler would be developed alongside the existing one, with a gradual migration path that allows both systems to coexist during the transition period. This approach mirrors how other large open-source projects handle architectural migrations—running old and new systems in parallel until the new one is proven stable.
Performance Implications and the Road Ahead
While concrete benchmarks are not yet available—the proposal is still in the design and early implementation phase—the expected performance gains are meaningful. Eliminating per-call graph construction overhead should improve token generation speed, particularly for interactive use cases where latency matters. Better memory planning should allow larger models to fit in the same amount of VRAM. And improved multi-backend scheduling should make split-device configurations faster and more reliable.
The timing of this proposal is notable. The open-source inference space has grown increasingly competitive, with projects like vLLM dominating server-side deployment and new entrants like MLX (from Apple) and TensorRT-LLM (from NVIDIA) targeting specific hardware platforms. llama.cpp’s strength has always been its portability and its ability to run on modest hardware, but maintaining that advantage requires continuous architectural investment. This redesign represents exactly that kind of investment—a willingness to rethink foundational assumptions in order to stay relevant as models and hardware evolve.
What This Signals About the State of Local AI Inference
The broader significance of this proposal extends beyond llama.cpp itself. It reflects a maturation of the open-source AI inference stack. Early frameworks were built quickly to meet urgent demand—people wanted to run LLaMA on their laptops, and they wanted it yesterday. The resulting code was functional but carried technical debt. Now, with the initial rush subsiding and the requirements becoming clearer, projects like llama.cpp are entering a phase of deliberate re-architecture.
This pattern is familiar from the history of software infrastructure. Linux, PostgreSQL, and other foundational open-source projects all went through similar phases—rapid initial development followed by careful internal restructuring that preserved external interfaces while dramatically improving internals. If the llama.cpp team executes this transition well, the project could emerge with an architecture capable of supporting not just today’s models but the significantly larger and more complex architectures that are already on the horizon. For the thousands of developers and millions of users who depend on llama.cpp for local AI inference, that prospect is worth watching closely.