Print of European HPC Application Support Portal website.
Measuring Power Use on EuroHPC Systems
29/01/2026

Developing a Heart Language Foundation Model with support from EPICURE

By Marjan Gusev and Dimitar Mileski (Innovation DOOEL)

 

Heart language is unique in its voice, attitude, accent, and grammar. We aim to develop a foundational heart language model (HLM) that analyzes the specifics of each heart. Creating a generative AI solution to detect and classify arrhythmia using heart language analysis is challenging, as research activities require extensive high-performance computing (HPC) resources and expertise.

 

Our business proposal is to develop a prototype heart-language model for a heart- monitoring solution in the EU, utilising wearable ECG sensors and generative AI. This solution automatically alerts outpatients to dangerous arrhythmias, a currently unavailable service.

 

The benefits include earlier hospital discharges, reduced hospitalisation time, improved patient monitoring during everyday activities, prevention of severe heart damage, improved healthcare, and increased life expectancy. Such solutions would boost our sales by reaching more customers, including large healthcare providers.

 

We certified ViewECG as a software medical device (CE Mark) for heart monitoring in 2020. During development, we implemented various signal processing and AI-based algorithms to meet the performance requirements set by medical standards. The AI Cardiologist project (part of the ELISE Open Call, funded by the EU H2020 project No 951847) delivered a proof-of-concept and prototype of a generative AI solution for detecting atrial fibrillation, while the CardioHPC experiment (part of the EU HPC Joint Undertaking through the FF4EuroHPC project under grant agreement No 951745) provided the HPC expertise required to develop machine learning (ML) algorithms, supporting the development of additional generative AI models and the heart language model.

 

Our objective is to train the algorithm using a comprehensive set of benchmark ECG data and various tokenisers, which necessitates significant computing resources. In the Elise project, we utilised annotations rather than ECG samples, resulting in over 125 times less data. Developing an initial model took six months of iterative experimentation, using two modern GPUs (NVIDIA Ampere A100).

 

We understood that creating a new generative AI solution would require 100 times more resources, achievable within a reasonable timeframe only through large-scale HPC. Through the EuroHPC Benchmark Access call, we received 400 node hours over two months. This access enabled us to utilise the MeluXina GPU partition at LuxProvide in Luxembourg, which was completed in January 2025. We were later granted six months of Development Access on MeluXina, completed in December 2025.

 

 

EPICURE HPC support for the HLM project

 

We requested EPICURE support to address scalability and performance challenges in running PyTorch-based foundation model training on EuroHPC infrastructure. The main challenge was migrating from single-node and DataParallel execution to an efficient multi-node, multi-GPU setup using PyTorch Distributed Data Parallel (DDP) under SLURM.

 

EPICURE provided technical guidance and concrete implementation, including a containerised training environment using Apptainer, SLURM job submission scripts, and a torchrun-based launcher that correctly configures NCCL, ranks, and world size across nodes. Additionally, EPICURE optimised the training code by tuning DDP communication (e.g. bucket_cap_mb), enabling gradient accumulation and mixed precision, and integrating profiling tools. This support enabled scalable, synchronised training on MeluXina with improved throughput, stability, and resource utilisation.

 

 

Benefits from the EPICURE support

 

The support from EPICURE enabled optimised multi-node, multi-GPU distributed training on the MeluXina HPC platform across multiple CPU, GPU, and sliding window configurations, evaluated on 16 datasets, delivering significant HPC-level performance gains. Using the provided Apptainer container and SLURM/torchrun launcher, we achieved full GPU utilisation with correct device binding across nodes.

 

Performance analysis revealed that a sliding window size of 128 consistently delivered optimal throughput, while CPU scaling beyond 16 cores did not provide additional speedup. In contrast, GPU scaling achieved sub-linear performance gains, with two GPUs yielding a 1.6× speedup and four GPUs reaching a maximum of 1.9×, below the theoretical 4× linear scaling. This limitation is primarily due to an I/O bottleneck in the data generation pipeline during tokenizer training, which limited data feeding rates and led to underutilisation of GPU resources, compounded by pipeline stalls and synchronisation overheads.

 

Key optimisations included reducing gradient communication overhead via DDP bucket tuning, gradient accumulation to maximise the effective batch size, and bfloat16 mixed precision to reduce the memory footprint and improve throughput. Real-time profiling with the Torch Profiler enabled the identification of CPU, GPU, and memory bottlenecks, while an efficient DataLoader design (pinned memory, persistent workers, DistributedSampler) ensured high data throughput.

 

Collectively, these optimisations enabled successful experimentation across all 16 datasets with diverse model configurations, reduced wall-clock training time, and improved energy efficiency per sample, while highlighting both the strong potential of HPC for accelerating AI model development in healthcare and the need for further code optimisation to improve GPU scaling efficiency.

 

 

Learn more about the project:

https://www.ecgalert.com/software.html

https://aicardiologist.innovation.com.mk/

https://www.nature.com/articles/s41598-024-84270-x

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *