SCynergy 2026

09/03/2026

Blue banner for the Austrian-Slovenian HPC Meeting 2026.

Austrian-Slovenian HPC Meeting 2026 – ASHPC26

24/03/2026

17/03/2026

EPICURE Hackathon: speeding up your code

By Bert Jorissen (University of Antwerp)

Research software is often difficult to maintain. Over time, scientific software evolves organically, with new features being rapidly introduced. As a result, performance optimisation can be overlooked when the focus is on obtaining accurate results. Time and again, these codes are written by the researchers themselves. Running such code efficiently on HPC systems is not always straightforward.

The EPICURE hackathons give you, the researchers, an opportunity to collaborate directly with HPC experts to optimise your code. Over several days, we analyse your code, identify bottlenecks, and help you implement improvements. Even small optimisations made during a hackathon can lead to noticeable performance gains.

In December 2025, EPICURE organised a hackathon focused on Python optimisations. Several projects received support, whether they were plain Python or advanced combinations. Below, we highlight one example and invite you to join the next EPICURE Hackathon in April 2026, dedicated to code optimisations for heterogeneous HPC environments.

Case study: GPU-accelerated binning

One of the projects at the hackathon was performing a binning operation on large 1D datasets to compute the per-bin means using GPUs. A small example was provided in Python, but the main code was written in C++. The aim was to implement a binning algorithm on the GPU to reduce data movements between the CPU and GPU, while speeding up the binning process.

The binning problem had as input data x-values and corresponding data values. For each x-value, the corresponding data-value is assigned to a specific bin, based on the value of x. The output is the mean of the data values for each bin.

Baseline implementation in Python

In Python, this can be easily computed with scipy.stats.binned_statistic(…, statistic=’mean’). The first step was extracting the data from the research C++ code and analysing it using the Python function mentioned above. This provided a baseline Python implementation, which we could use to compare the results of the C++ and GPU implementations.

Making a GPU kernel

On the GPU, a small kernel performs the data binning. In this kernel, a list of counts and sums for all bins must be maintained. The first naive implementation of the kernel used atomic operations to update the counts and sums for each bin.

A GPU has multiple threads running in parallel, processing different data points at the same time. With an atomic operation, a shared array of counts and sums is used by all threads, and each thread updates the counts and sums for its data. When data points are grouped close to each other, a contention can occur. If two threads try to update the count for the same bin, one thread will have to wait for the other to finish.

The second implementation of the kernel used a different approach to reduce the number of atomic operations. Shared memory allows threads to communicate and share data without needing atomic operations. Each thread maintains its own local counts and sums for the bins in shared memory. Only at the end of the kernel, when all threads have finished processing their data points, are the local counts and sums combined into the global counts and sums using atomic operations.

Finally, optimisations were made to further speed up the kernel by optimising the combination step of the shared memory counts and sums into the global counts and sums. This resulted in a good-performing kernel that the researchers can add to their GPU code.

Relative speedup

On a simple test case, the GPU kernel was about 20 x faster than an unoptimized CPU baseline. With compiler optimisations enabled on the CPU, the speedup stabilised around 10x.

For the naive kernel using atomics, the input data also influenced the performance of the code. If the data is clustered around a few bins, or if the number of bins is small, the speedup is reduced. When the data is more evenly distributed across the bins, the speedup is higher.

We achieved a significant speedup and kept the data on the GPU, which was the main goal of this project.

See you in Porto?

Bring your code to the next EPICURE Hackathon and speed up your science. The next Hackathon will take place in Porto, Portugal, from 27 to 29 April, but you can also join remotely. The main focus will be on code optimisation for heterogeneous HPC environments. More information is available on the EPICURE Hackathon event page.

BACK TO BLOG

Comments

Discover other posts

09/07/2026

Group photo of participants at the ISC High Performance 2026.

Published by Catarina Fernandes on 09/07/2026

Categories

Rethinking HPC User Support for the AI and Exascale Era: Reflections from ISC26

09/07/2026

Published by Catarina Fernandes on 09/07/2026

Categories

Getting JAX to perform on AMD: Optimising hybrid MD simulations on LUMI

09/07/2026

Two presenters explain optimisation examples, while participants work interactively on their laptops during the hands-on session.

Published by Catarina Fernandes on 09/07/2026

Categories

Pushing the limits: Inside the EPICURE High-Scalability Workshop on MareNostrum 5

06/07/2026

Close-up of the LUMI supercomputer cabinets at the CSC data centre in Finland.

Published by Catarina Fernandes on 06/07/2026

Categories

SCynergy 2026

Austrian-Slovenian HPC Meeting 2026 – ASHPC26

EPICURE Hackathon: speeding up your code

Case study: GPU-accelerated binning

Baseline implementation in Python

Making a GPU kernel

Relative speedup

See you in Porto?

Comments

Leave a Reply Cancel reply

Discover other posts

Rethinking HPC User Support for the AI and Exascale Era: Reflections from ISC26

Getting JAX to perform on AMD: Optimising hybrid MD simulations on LUMI

Pushing the limits: Inside the EPICURE High-Scalability Workshop on MareNostrum 5

Porting BigDFT to AMD GPUs

This project has received funding from the European High-Performance Computing Joint Undertaking under grant agreement No.101139786.

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or EuroHPC Joint Undertaking. Neither the European Union nor the granting authority can be held responsible for them.

Developed By SUBA.PT

Copyright © Epicure 2024

Privacy Policy