Blog

02/03/2026

Distributed LLM Inference on Deucalion’s ARM Partition with EPICURE Support

By Alícia Oliveira (INESC TEC / Deucalion)

Most Large Language Model (LLM) inference systems are designed for GPU clusters, especially in multi-node deployments. Still, ARM-based compute nodes are becoming increasingly relevant in research and production settings due to their energy efficiency, as explored here. A key challenge is making inference engines and orchestration stacks more hardware-agnostic so they can scale in CPU-only environments.

The main goal of this project is to scale LLM inference on Deucalion’s ARM partition across multiple nodes, combining vLLM, the inference serving layer, with Ray, an orchestration and distributed execution framework, to establish a baseline for distributed, CPU-only inference.

This publication is aimed at professionals with experience in Machine Learning and High-Performance Computing who are interested in scaling LLM inference on CPU-based clusters.

EPICURE support – What we did

We began by installing and validating vLLM on a single ARM CPU node. After that, we focused on adapting the vLLM code for multi-node execution using Ray. We experimented with two strategies for distributing the workload across machines: pipeline parallelism, in which different phases of the model run on separate nodes, and tensor parallelism, where the calculations within each layer are shared across multiple nodes. Additionally, we tested a hybrid strategy that combines both types of parallelism.

To gain insight into the sources of performance gains and losses, we collected measurements during our experiments. We tracked CPU utilization, memory usage, and the growth of the key-value (KV) cache, which stores intermediate attention data to speed up text generation as sequences get longer. Additionally, we monitored inter-node communication to analyze how data movement and synchronization influenced scaling. This approach enabled us to distinguish between limitations caused by computation and those arising from memory pressure. It also helped us understand how scheduling choices in pipeline execution contributed to communication overhead.

Value delivered through EPICURE support

We utilized Llama-3.1-8B on a 4-node ARM CPU setup and improved our end-to-end throughput by a factor of 3.7. This enhancement was achieved by optimizing CPU threading in PyTorch, ensuring stable KV-cache allocation under concurrent requests, and tuning pipeline micro-batching (i.e., splitting requests into smaller chunks to keep pipeline stages busy) to minimize unnecessary synchronization and communication overhead.

Overall, these results establish a foundation for future optimizations, focusing on enhancing memory efficiency, reducing latency, and refining pipeline execution based on inter-node communication costs. Additionally, the orchestration overhead we observed raises an important question for future research: Is Ray the best orchestration option for vLLM-based LLM inference at scale?

BACK TO BLOG

Comments

Discover other posts

06/07/2026

Close-up of the LUMI supercomputer cabinets at the CSC data centre in Finland.

Published by Catarina Fernandes on 06/07/2026

Categories

Porting BigDFT to AMD GPUs

18/06/2026

Image of a supercomputer with a blue filter

Published by Catarina Fernandes on 18/06/2026

Categories

From Benchmark to Product Readiness: Supporting the Latvian Genome Reference on MeluXina

15/06/2026

Smoke rises from a vegetation fire spreading through dry grass and shrubs, with flames visible beneath leafless trees. The scene highlights the type of wildfire conditions that can threaten natural landscapes and nearby communities.

Published by Catarina Fernandes on 15/06/2026

Categories

How HPC scales AI for fire and flood detection on EuroHPC systems

22/05/2026

Audience attending a presentation on HPC in Portugal, with a speaker presenting slides at the front of a classroom-style workshop.

Published by Catarina Fernandes on 22/05/2026

Categories

Advancing global coastal risk modelling through reproducible HPC workflows with EPICURE

High-Scalability Workshop on MareNostrum 5

Distributed LLM Inference on Deucalion’s ARM Partition with EPICURE Support

EPICURE support – What we did

Value delivered through EPICURE support

Comments

Leave a Reply Cancel reply

Discover other posts

Porting BigDFT to AMD GPUs

From Benchmark to Product Readiness: Supporting the Latvian Genome Reference on MeluXina

How HPC scales AI for fire and flood detection on EuroHPC systems

EPICURE Hackathon on Code Optimisation for Heterogeneous HPC Environments

This project has received funding from the European High-Performance Computing Joint Undertaking under grant agreement No.101139786.

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or EuroHPC Joint Undertaking. Neither the European Union nor the granting authority can be held responsible for them.

Developed By SUBA.PT

Copyright © Epicure 2024

Privacy Policy