Maximizing GPU utilization with NVIDIA Multi-Process-Service for Quantum Monte Carlo simulations with the TurboRVB code
08/06/2026
The Leonardo supercomputer at CINECA, where EPICURE-supported work explored GPU utilisation improvements for scientific simulations.
EPICURE white paper explores GPU utilisation improvements with NVIDIA MPS
15/06/2026

How HPC scales AI for fire and flood detection on EuroHPC systems

By Jaime Santos (AiTecServ

 

 

As climate-related disasters intensify in both frequency and severity, reliable early-detection systems are critical for protecting communities, infrastructure and ecosystems. With support from EPICURE, AiTecServ, an AI company based in Portugal, has developed computer vision models to detect fire, smoke, and floodwater from camera feeds in real time, automatically, accurately, and at scale.

 

By leveraging high-performance computing (HPC), the project addresses the challenge of accurately detecting fire and flood events while ensuring that model training scales efficiently on modern systems. The goal is to support early warning systems that help mitigate the impact of climate-related disasters.

 

Fire and flood detection using artificial intelligence (AI) can reduce response times and potential damage but requires high-quality data and substantial computational power. As datasets and model sizes grow, traditional training environments quickly become limiting.

 

To overcome these constraints, AiTecServ used EuroHPC systems to ensure that detection models are accurate, scalable, reproducible, and ready for real-world deployment.

 

 

 

Scientific and technical challenges

 

However, scaling these models efficiently on HPC systems introduces significant technical challenges. Training large deep learning models at scale presents a range of technical constraints.

 

The project had to address:

 

  • – GPU memory constraints;
  • – Instability in multi-node Distributed Data Parallel (DDP) training;
  • – Communication bottlenecks across nodes;
  • – NCCL communication issues;
  • – Efficient resource allocation on HPC systems.

 

 

 

EuroHPC resources and EPICURE support

 

To address these challenges, EPICURE played a critical role in overcoming technical limitations, providing expert guidance on distributed deep learning and HPC optimisation. The collaboration focused on resolving memory and stability issues, enabling efficient multi-node training, and improving NCCL performance. With customised SLURM configurations and refined Python scripts, the team was able to improve training robustness and performance.

 

This enabled the team to run large-scale distributed training on MareNostrum 5, accelerating experimentation and evaluating the scalability limits of their models.

 

 

A partially submerged car sits in a flooded urban street, surrounded by high water levels that have disrupted traffic and infrastructure. The image illustrates the severe impact of flooding events on cities, transportation and public safety.

 

 

Results and impact for public safety

 

These optimisations enabled stable deep learning training on EuroHPC systems, improved GPU utilisation, and significantly reduced memory-related failures. The team now has a reliable, reproducible training pipeline capable of scaling efficiently, along with deeper insights into performance bottlenecks and communication behaviour in distributed AI workloads.

 

Beyond the technical achievements, the project illustrates the societal relevance of combining HPC and AI. Improved fire detection AI and flood detection models can support faster response times, enhance disaster preparedness and contribute to more resilient cities and regions.

 

The Scaling Deep Neural Networks for Public Safety project demonstrates how HPC can support scalable AI for real-world applications, strengthening the role of EuroHPC infrastructure in public safety.

 

In addition to these technical results, the project is now moving from research to real-world deployment. The technology is currently being prepared for deployment in four pilot projects: two in Portugal, focused on asset protection for the insurance and forest industries, and two with governmental entities in Spain for civil protection use cases.

 

 

 

Next steps for the project

 

Building on these results, the next phase will focus on scaling experiments further, improving inference efficiency, and integrating with real-time or near-real-time fire monitoring systems. The team also plans to explore additional datasets and architectures, while refining energy-efficient training strategies on HPC systems.

 

To learn more about the Scaling Deep Neural Networks for Public Safety project, visit the project page on the European HPC application support portal and explore the product page on the AiTecServ website.

 

 

Reference: Cheriyan, J. and Santos, J. (2025). “Scalable Detection of Environmental Events on EuroHPC MeluXina”, Procedia Computer Science, 267, pp. 246–255.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *