EPICURE Hackathon: speeding up your code
By Bert Jorissen (University of Antwerp)
Research software is often difficult to maintain. Over time, scientific software evolves organically, with new features being rapidly introduced. As a result, performance optimisation can be overlooked when the focus is on obtaining accurate results. Time and again, these codes are written by the researchers themselves. Running such code efficiently on HPC systems is not always straightforward.
The EPICURE hackathons give you, the researchers, an opportunity to collaborate directly with HPC experts to optimise your code. Over several days, we analyse your code, identify bottlenecks, and help you implement improvements. Even small optimisations made during a hackathon can lead to noticeable performance gains.
In December 2025, EPICURE organised a hackathon focused on Python optimisations. Several projects received support, whether they were plain Python or advanced combinations. Below, we highlight one example and invite you to join the next EPICURE Hackathon in April 2026, dedicated to code optimisations for heterogeneous HPC environments.
Case study: GPU-accelerated binning
One of the projects at the hackathon was performing a binning operation on large 1D datasets to compute the per-bin means using GPUs. A small example was provided in Python, but the main code was written in C++. The aim was to implement a binning algorithm on the GPU to reduce data movements between the CPU and GPU, while speeding up the binning process.
The binning problem had as input data x-values and corresponding data values. For each x-value, the corresponding data-value is assigned to a specific bin, based on the value of x. The output is the mean of the data values for each bin.
Baseline implementation in Python
In Python, this can be easily computed with scipy.stats.binned_statistic(…, statistic=’mean’). The first step was extracting the data from the research C++ code and analysing it using the Python function mentioned above. This provided a baseline Python implementation, which we could use to compare the results of the C++ and GPU implementations.
Making a GPU kernel
On the GPU, a small kernel performs the data binning. In this kernel, a list of counts and sums for all bins must be maintained. The first naive implementation of the kernel used atomic operations to update the counts and sums for each bin.
A GPU has multiple threads running in parallel, processing different data points at the same time. With an atomic operation, a shared array of counts and sums is used by all threads, and each thread updates the counts and sums for its data. When data points are grouped close to each other, a contention can occur. If two threads try to update the count for the same bin, one thread will have to wait for the other to finish.
The second implementation of the kernel used a different approach to reduce the number of atomic operations. Shared memory allows threads to communicate and share data without needing atomic operations. Each thread maintains its own local counts and sums for the bins in shared memory. Only at the end of the kernel, when all threads have finished processing their data points, are the local counts and sums combined into the global counts and sums using atomic operations.
Finally, optimisations were made to further speed up the kernel by optimising the combination step of the shared memory counts and sums into the global counts and sums. This resulted in a good-performing kernel that the researchers can add to their GPU code.
Relative speedup
On a simple test case, the GPU kernel was about 20 x faster than an unoptimized CPU baseline. With compiler optimisations enabled on the CPU, the speedup stabilised around 10x.
For the naive kernel using atomics, the input data also influenced the performance of the code. If the data is clustered around a few bins, or if the number of bins is small, the speedup is reduced. When the data is more evenly distributed across the bins, the speedup is higher.
We achieved a significant speedup and kept the data on the GPU, which was the main goal of this project.
See you in Porto?
Bring your code to the next EPICURE Hackathon and speed up your science. The next Hackathon will take place in Porto, Portugal, from 27 to 29 April, but you can also join remotely. The main focus will be on code optimisation for heterogeneous HPC environments. More information is available on the EPICURE Hackathon event page.




