Distributed LLM Inference on Deucalion’s ARM Partition with EPICURE Support
By Alícia Oliveira (INESC TEC / Deucalion)
Most Large Language Model (LLM) inference systems are designed for GPU clusters, especially in multi-node deployments. Still, ARM-based compute nodes are becoming increasingly relevant in research and production settings due to their energy efficiency, as explored here. A key challenge is making inference engines and orchestration stacks more hardware-agnostic so they can scale in CPU-only environments.
The main goal of this project is to scale LLM inference on Deucalion’s ARM partition across multiple nodes, combining vLLM, the inference serving layer, with Ray, an orchestration and distributed execution framework, to establish a baseline for distributed, CPU-only inference.
This publication is aimed at professionals with experience in Machine Learning and High-Performance Computing who are interested in scaling LLM inference on CPU-based clusters.
EPICURE support – What we did
We began by installing and validating vLLM on a single ARM CPU node. After that, we focused on adapting the vLLM code for multi-node execution using Ray. We experimented with two strategies for distributing the workload across machines: pipeline parallelism, in which different phases of the model run on separate nodes, and tensor parallelism, where the calculations within each layer are shared across multiple nodes. Additionally, we tested a hybrid strategy that combines both types of parallelism.
To gain insight into the sources of performance gains and losses, we collected measurements during our experiments. We tracked CPU utilization, memory usage, and the growth of the key-value (KV) cache, which stores intermediate attention data to speed up text generation as sequences get longer. Additionally, we monitored inter-node communication to analyze how data movement and synchronization influenced scaling. This approach enabled us to distinguish between limitations caused by computation and those arising from memory pressure. It also helped us understand how scheduling choices in pipeline execution contributed to communication overhead.
Value delivered through EPICURE support
We utilized Llama-3.1-8B on a 4-node ARM CPU setup and improved our end-to-end throughput by a factor of 3.7. This enhancement was achieved by optimizing CPU threading in PyTorch, ensuring stable KV-cache allocation under concurrent requests, and tuning pipeline micro-batching (i.e., splitting requests into smaller chunks to keep pipeline stages busy) to minimize unnecessary synchronization and communication overhead.
Overall, these results establish a foundation for future optimizations, focusing on enhancing memory efficiency, reducing latency, and refining pipeline execution based on inter-node communication costs. Additionally, the orchestration overhead we observed raises an important question for future research: Is Ray the best orchestration option for vLLM-based LLM inference at scale?





