The last 18 months with EPICURE: why ask for help?
by João Barbosa (IT4Innovations National Supercomputing Center)
If you’re about to submit or have just received access to computational resources, choose Application Support (EPICURE) in your request to receive the link to your award or apply later via our portal. You may find the quickest progress comes from not tackling the most complex parts alone.
If you’re running on EuroHPC systems and something feels harder than it should – porting to GPUs, scaling past a stubborn node count, keeping your ML training stable at size –, EPICURE was built for exactly that moment. Over the last 18 months, we treated advanced support as a collaboration, not a queue, and, in many cases, the results spoke for themselves: faster starts, reproducible runs, and measurable improvements that carried beyond the engagement.
EPICURE in practice: speed and outcomes
Our approach is straightforward. From the instant a request arrives, the clock starts ticking. On average, projects are moved from “submitted” to a named lead partner in about 2.9 days, and into their first working meeting roughly a week later. Engagements typically last around three months, with approximately two person-months of focused effort, where it matters most: profiling first, changing second, and validating continuously.
Users ultimately gave the assistance a 4.7/5 rating for helpfulness and responsiveness and a 4.8/5 rating for recommendation. In real life, those scores can translate into fewer lost weeks, fewer enigmatic regressions, and code that responds appropriately when the system or toolchain changes unavoidably.
What changed for supported projects by EPICURE?
What changed for the projects we touched was disciplined engineering delivered quickly. Fragile builds that only worked on a lucky node were transformed into portable, version-locked environments, often containerized, allowing teams to iterate without chasing ABI gremlins.
Jobs that quietly left GPUs idle were re-launched with topology-aware bindings and telemetry that made under-utilisation obvious. Memory blowups that initially appeared to be “bad luck at scale” became tractable once we mapped batch size, staging buffers, and host–device transfers to the realities of each machine. And for groups pushing into AI, the gains often came from everything surrounding the kernels: staging data so the filesystem kept up, launching in ways that respected the node layout, and choosing precision modes that preserved scientific validity while still allowing throughput to climb.
Across 135 support requests, that rhythm repeated. Not every project finishes within an evaluation window. Still, the cadence remains consistent: days to get started, weeks to turn measurements into changes, and months to lock in improvements and hand over artifacts that the team can keep. Just as important, fixes stopped being one-offs. As patterns emerged – collectives stalling on a particular fabric, GPU occupancy sinking for a known reason, I/O saturating in predictable ways – we wrote them down. Templates for Slurm on actual node topologies. MPI and NCCL settings that are more beneficial than detrimental. Memory-tuning checklists that prevent the classic accidental out-of-memory. Known-good container stacks for CUDA or ROCm paired with specific framework versions. The next project began further ahead because the last one shared what it had learned.
Application support for projects of all sizes
If you’re wondering whether your problem is big enough, that’s the point: you don’t need to be a lighthouse code to benefit. We’ve worked shoulder-to-shoulder with teams modernising legacy CPU-only solvers, with SMEs who need reliable speedups without a six-month detour into tooling, and with researchers whose ML pipelines are brilliant but brittle at supercomputing scale. The heterogeneity across EuroHPC systems isn’t going away, and it shouldn’t; EPICURE can offer a more straightforward path through it, guided by people who have already hit the same walls this year.
How to engage with EPICURE
The best time to involve us is during your allocation request: choose EPICURE Support, so you receive the EPICURE link with your award. If you decide to apply later, you can do so via our support portal. A short, focused performance clinic can help turn “it’s slow” into a specific hotspot with concrete next steps. And if your work spans multiple centres or architectures, that’s fine too; collaboration across sites is normal for us, and it’s part of why solutions are generalised.
If you take one message from the last eighteen months, make it this: advanced support isn’t a last resort; it can be an accelerator. Ask for it when you’re planning a port, when you’re designing a scaling experiment, or when you’ve hit the kind of problem that feels like it will eat the next month. We bring the measurement, playbooks, and guardrails, and aim to leave you with artifacts that you can reuse after the engagement.
Partners
We gratefully acknowledge the collaboration of the EPICURE partners (alphabetical order): Academic Computer Centre Cyfronet AGH (CYFRONET); Barcelona Supercomputing Center – Centro Nacional de Supercomputación (BSC); CINECA Consorzio Interuniversitario; CINES – Centre Informatique National de l’Enseignement Supérieur; CSC – Tieteen tietotekniikan keskus Oy; Danish e-Infrastructure Consortium (DeiC/DTU); Forschungszentrum Jülich GmbH (FZJ); GENCI – Grand Équipement National de Calcul Intensif; INESC TEC – Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência; Institut informacijskih znanosti (IZUM); Jožef Stefan Institute (JSI); Kungliga Tekniska högskolan (KTH); LuxProvide S.A.; Sofia Tech Park JSC; Universiteit Antwerpen (UAntwerpen); and VSB – Technical University of Ostrava (IT4I@VSB).





