INVITED KEY TALKS Connected Components Revisited on Kepler Gernot Ziegler – NVIDIA Locating connected regions in images and volumes is a substantial building block in image and volume processing pipelines. We demonstrate how the Connected Components problem strongly benefits from a new feature in the Kepler architecture, direct thread data exchange through the SHUFFLE instruction. GCN Architecture, HSA Platform Evolution and the AMD Developer Ecosystem Benjamin Coquelle – AMD This presentation will present the Graphics Core Next architecture and how one can achieve the maximum performance on such architecture. For that matter we will go through actual OpenCL examples and show how OpenCL is mapped to AMD GPUs. Most of the hardware components will be covered (scheduler, Compute units architecture, cache access…) through different examples. We will also briefly talk about AMD tools to debug and profile code: codeXL. Finally, we will also present how you can use HSA to easily take benefit of an heterogeneous platform. One Can Simply Walk Into Heterogeneous Parallelism Alex Voicu – Microsoft “Parallel programming is hard!” has become a seldom questioned truism – but what lies beyond this statement? This presentation focuses on clarifying exactly what it takes to write clean, efficient, code, which can run on various classes of accelerators that can be subsumed under a common abstract machine model. Starting from two fundamental parallel programing primitives, reduce and scan, we introduce what we regard as the minimal yet sufficient programming model for heterogeneous parallelism. Along the way, we try to establish which are the current dominating “schools-of-thought”, outlining their medium and long-term shortcomings. Upon concluding, we should be in a position to re-assess the introductory truism, and dismiss it as, at best, temporary. SYCL: An Abstraction Layer for Leveraging C++ and OpenCL Alastair Murray – Codeplay Software Ltd., Edinburgh The performance and power advantages of parallel hardware have caused it to be adopted in domains ranging from embedded systems to data centres. This means that increasingly complex software must be run on parallel hardware, which in turn has led to an increase in the desire from developers for more powerful but simpler parallel programming languages and models. The SYCL abstraction layer is a solution to this problem. It adds the power and flexibility of C++ to the performance and efficiency of the existing OpenCL ecosystem. SYCL uses a shared source design to allow various powerful C++ features, such as templates or inheritance, on heterogeneous devices without the need for a separate device language. This talk will describe the SYCL abstraction layer and how it can be used to create complex programs or libraries. Codeplay's own implementation will also be discussed, including looking at the use of OpenCL SPIR to provide portability. SPECIAL TOPICAL TALKS Methodology for Heterogeneous Programming on OpenCL and Its Applications Olga Sudareva, Pavel Bogdanov1 – Institute of System Research, Russian Academy of Sciences, Moscow (НИСИИ РАН) During the last few years hybrid computers based on massively-parallel accelerators gained great popularity. The most common massivelyparallel accelerators are GPGPUs (general purpose graphic processing units). At the same time, Intel MIC appears promising. In the last Top500 list (November 2014), the first place is Tianhe-2[1] with 55 PFlops, based on Intel MIC, and the second place is Cray Titan[2] with 27 PFlops, based on NVidia Tesla K20X. However, there is still no general programming model which could be used to program all processors of one node. Therefore, there is no general model to program the entire distributed system. Investigations of such models are being conducted. A classification of actual heterogeneous programming models, such as StarPU[3], OpenACC[4], ompSS[5], OpenMP 4.0[6], VexCL[7], SYCL [8] and etc. will be given. It is important, that with the contemporary development of scientific and applied engineering software, one can seldom find theoretical estimates for the computational capacity of algorithms and their expected performance on the hardware-based target platforms. We propose a method for estimating the expected performance of a task prior to the implementation stage which allows deciding whether solving of the task is sensible on such systems. We thereupon propose a technique for analysis of the original problem for a compute node model and the distributed system as a whole within the formalized notation of the OpenCL standard. Models are defined by a set of parameters, such as the number of massively parallel coprocessors, the memory size of each one, memory bandwidth, peak performance, etc. Each computational task 1 [email protected] can be analyzed within the models' frameworks and the expected performance can be deduced as a formula depending on the parameters. We demonstrate that a number of well-known task classes can be implemented efficiently on heterogeneous parallel systems. Moreover, we present our self-made infrastructure for heterogeneous computing, which is implemented as a command scheduler based on the OpenCL model and is able to exchange data with other schedulers via MPI. Program for this scheduler is a dependency graph of commands executed on compute devices. The general way to program the entire distributed system using this infrastructure is the following. One device is controlled by one scheduler, the synchronization of schedulers on one node is done by adding dependencies between their commands and the nodes communicate via MPI calls. The methodology was applied to a wide range of problems, including running the HPL[9] tests and NAS Parallel Benchmarks[10] (FFT, MG, CG), solving model hydrodynamic problem, modeling toroidal plasma evolution, 3D-modeling of transient dynamics of perturbations in astrophysical disc flows. All codes related were implemented and launched on distributed heterogeneous systems. All necessary compute kernels were written and general algorithms were developed, which allow these problems to be solved on distributed systems, utilizing all OpenCL devices. As a "side-effect" we obtained a heterogeneous BLAS version, which can speedup an application on OpenCL devices via a simple re-link with the new library. Scalability of up to eight accelerators in one node was achieved [11]. Test launches were performed on a wide range of processor architectures and node configurations: one node with 2 Intel Xeon Sandy\Ivy Bridge CPUs and 8 ACC (AMD Radeon\FirePro GPU, NVidia TITAN GPU), supercomputer K100 (64 nodes with 2 Intel Xeon X5690 CPUs and 3 NVidia Tesla M2050 GPUs per node) [12], supercomputer K10 (6 nodes with 2 Intel Xeon E5-2620 CPUs and 3 NVidia Tesla 2090 GPUs per node)[13], promising mini-supercomputers configured at ISR RAS. Implementation of any task consists of three stages: estimating theoretical performance, writing compute kernels in OpenCL C and developing the upper logic (dependency graph) as a scheduler program. In our presentation we will not go into details of writing compute kernels (which are a subject for a separate discussion), but rather focus on the technique for working out theoretical estimates of expected performance, writing corresponding dependency-graph programs, discussing the obtained results and further prospects of the approach. The simplicity and transparency of the programming model, the relative ease of development of real-world codes and their excellent scalability prove the viability of the approach chosen. In addition, one of the major achievements is the code portability to all currently known hardware platforms. In some cases, of course, critical OpenCL C compute kernels require special porting to a particular accelerator. However, the upper logic level remains the same for all hardware platforms. In future, we plan to expand the package of applied software employing the program infrastructure for heterogeneous computing. 1. J. J. Dongarra:Visit to the National University for Defense Technology Changsha, China. Technical report, Oak Ridge National Laboratory, 18p, 2013. 2. http://www.olcf.ornl.gov/titan/ 3. http://runtime.bordeaux.inria.fr/StarPU/ 4. http://www.openacc-standard.org/ 5. http://pm.bsc.es/ompss 6. http://openmp.org/wp/openmp-specifications/ 7. https://www.khronos.org/opencl/sycl 8. https://github.com/ddemidov/vexcl 9. http://www.netlib.org/benchmark/hpl/ 10. http://www.nas.nasa.gov/publications/npb.html 11. http://devgurus.amd.com/thread/159457 12. http://www.kiam.ru/MVS/resourses/k100.html Efficient Large Scale Simulation of Stochastic Lattice Models on GPUs Jeffrey Kelling – Helmholtz-Zentrum Dresden-Rossendorf With growing importance of nano-patterned surfaces and nano-composite materials in many applications from energy technologies to nanoelectronics, a thorough understanding of the self-organized evolution of nano-structures needs to be established. Modelling and simulations of such processes can help in this endeavor and provide predictions for the turnout of manufacturing processes. In this talk GPGPU-enabled implementations of two stochastic lattice models will be discussed, shedding light on the complications which arise when simulations of stochastic processes are to make efficient use of massively parallel GPU architectures. A single-GPU implementation of the (2+1)-dimensional Roof-Top-model allows very precise large-scale studies of surface growth processes in the Kardar-Parisi-Zhang universality class.[1] Furthermore a multi-GPU enabled version of the 3d kinetic Metropolis lattice Monte-Carlo method [2] provides the capability to study the evolution of nano-structures both towards and out-of-equilibrium at spatio-temporal scales of experiments using only small to medium-sized GPU clusters. [1] J. Kelling, G. Ódor Extremely large-scale simulation of a Kardar-Parisi-Zhang model using graphics cards, Physical Review E 84, 061150 (2011) [2] J. Kelling, G. Ódor, F. Nagy, H. Schulz, K. Heinig Comparison of different parallel implementations of the 2+1-dimensional KPZ model and the 3-dimensional KMC model, The European Physical Journal - Special Topics 210, 175-187 (2012) GPU computing at ELI-ALPS Sándor Farkas – Extreme Light Infrastructure - Attosecond Light Pulse Source, Szeged (ELI-ALPS) ELI-ALPS, the Attosecond Light Pulse Source will provide the possibility for extreme short experiments performed at an outstanding high frequency. Data are produced on-line at the laser diagnostics benches, secondary sources and at the experimental end stations at the rates of 10 Hz-100 kHz. The predicted peak data rate can increase to 1Tb per second and the peak data volume to tens of petabytes per year. Scientific computing engineers are designing an efficient storage together with a sound processing solution that will be able to receive the big data flow, perform the processing and finally to store the data. Some state-ofthe art technologies have been investigated and evaluated for high performance computing, cost-effective scale-out data storage, robust virtualization and management: Ceph, OpenStack, Torque and others. One important aspect in the data processing chain is the access and efficient utilization of the local GPU cluster from the virtual machines managed by OpenStack. As most promising the PCI pass-through technology is being evaluated, tested and benchmarked with real physical simulation (PIC) codes. The current status, results and the future plans of the project will be presented. CONTRIBUTED TALKS Comparison of GPUs and Embedded Vision Coprocessors for Automotive Computer Vision Applications Zoltán Prohászka – Kishonti Informatics Ltd., Budapest Current trends in ADAS (Advanced Driver Assistance Systems) functionality of production passenger vehicles and corresponding regulations forecasts that new vehicles in the upcoming years will require Tera-operations/seconds class computing HW. This article compares GPUs and more specialised DSP-like accelerators based on the following aspects: Estimated transistor count and utilization of arithmetic units, data feed problems and data access patterns, functional safety, programming model and portability of algorithms resulted by academic/private research. A case study will be presented on our autonomous car project demonstrating vision only driverless cruising. PERSEUS: Image and Video Compression in the Age of Massive Parallelism Balázs Keszthelyi – V-NOVA Ltd., Cambridge This presentation is an introduction of V-Nova’s Perseus technology, covering its background, use-cases and the most important, distinctive design decisions which are driving its evolution. Perseus is a (currently) software based family of image and video compression codecs, targeting both contribution and distribution markets, just as well as OTT. The foundations of Perseus are analogue to the hierarchical nature of human vision, and this way it offers seamless, multi-scale experience. Using global operators, massive parallelism could be maintained without compromising compression efficiency. V-Nova’s engineers were successful to make great use of high-end GPUs just as well as their lowpower counterparts, balancing between the GPU utilization/throughput and low-latency required by contribution applications. Fisher Information-based Classification on GPUs Bálint Daróczy – Institute for Computer Science and Control, Hungarian Academy of Sciences, Budapest (SZTAKI) With the kernel methods we can solve classification and regression problems with linear methods such that we project the data into a high dimension space, where the linear surface can have complicated shape when projected back into the original space. However, kernel methods in the general form are hard to implement on GPUs, because of memory limitations. The talk has two part. On one hand we show that with the help of a model based on Fisher information we can transform objects with complicated similarity structures into linear ones, for example time series from multiple sensors. On the other hand we present the details of the GPU implementation of our Fisher information based method. A GPU-based program suite for numerical simulations of planetary system formation Emese Forgács-Dajka, Áron Süli, László Dobos – Eötvös University, Budapest (ELTE) An important stage of planetary system formation is the growth of planetesimals via collision. For a realistic model, numerous other effects beyond the gravitational force are needed to be taken into account: gas drag force, migrational forces resulting from the gravitational interaction of the gas disk and larger solid bodies, electromagnetic forces, etc. Collisions among bodies are of paramount importance in the final architecture of the emerging planetary system. For the sake of simplicity, however, collisions are usually described as perfectly inelastic, and since close multi-body interactions are extremely rare, only two-body collisions are considered, with perfect conservation of the momentum. Our software package implements Runge-Kutta integrators of various orders for scalable precision and can adaptively switch between GPU and CPU execution. This becomes important at later stages of planetary system formation when the number of bodies drops below a certain limit and CPUbased execution becomes more efficient. Our software is primarily designed for integrating systems of interacting planetesimals but also applicable to other n-body problems that require precise integration of the gravitational force, such as the motion of small objects of the Solar System. Semi-analytic modelling of galaxy spectra using GPUs Dezső Ribli – Eötvös University, Budapest (ELTE) Large sky surveys have measured the high resolution spectra of hundreds thousands of galaxies. These measurements provide the possibility to infer the physical properties of very distant galaxies. The most frequently used framework to interpret the galaxy spectra is the stellar population synthesis model (SPS). In SPS modelling galaxy spectra are constructed as the superposition of spectra of stellar populations with different ages. Usually, in SPS modelling, only intergrated spectral properties (Lick indices) are used for parameter inference, or not rigorous fitting methods are used (fitting by hand, MOPED). We constructed a command line application (SPS-FAST), which uses Markov-chain Monte-Carlo method to fit the SPS model parameters. Exploring the parameter likelihood space with MCMC, instead of simply finding best-fitting parameters is particurarly useful in the case of SPS, because the model suffers from serious degenerations, and other uncertainties. The application is written in C++, with the crucial parts written in OpenCL. It can run on both CPU and GPU, and it is able to fit a galaxy spectra in 10-20 seconds. Optimization possibilities of inhomogeneous cosmological simulations on massively parallel architectures Gábor Rácz, István Csabai, László Dobos – Eötvös University, Budapest (ELTE) Cosmological n-body simulations are done on large scales to understand the evolution of the distribution of dark and ordinary matter in the universe. The largest simulations can fairly well reproduce observations even though they all make a non-obvious assumption: homogeneous expansion of space. While we know that the distribution of matter, hence the expansion of space, is not homogeneous, the approximation is necessary due to lack of numerical techniques for the direct solution of Einstein’s equations. Nevertheless, toy models can be constructed that account for the inhomogeneous expansion of space and be integrated over time to determine the average of the scale factor as a function of time. As large-scale simulation codes are written with the homogeneous expansion encoded into their core, for our toy model, a brand new n-body code had to be developed. We show some preliminary results from our test runs and discuss the role of parallelization in a simulation where brute-force n-body force kernels need to be combined with complex algorithms like Voronoi tessellation. GPUs in a global Earth-based space weather monitoring network Dávid Koronczay – Geodetic and Geophysical Institute, Research Centre for Astronomy and Earth Sciences, Hungarian Academy of Sciences, Sopron (CSFKI GGI) AWDANet (Automatic Whistler Detector and Analyzer Network) is a realtime plasmasphere monitoring tool based on whistler mode propagation of ELF/VLF impulses through the plasmasphere. High data rate and low bandwidth necessitates on-site processing at each monitoring station, and real-time results are achieved through GPU's. We present this network and its nodes. Forecasting Gamma-Ray Bursts with gravitational wave detectors Gergely Debreczeni – Wigner Research Centre for Physics, Hungarian Academy of Sciences, Budapest (Wigner RCP) Modern gravitational wave (GW) detectors are hunting for GWs originating from various sources among other from binary neutron star (BNS) coalescence. It is assumed that in some cases it is the coalescence (merging) of such BNS system which is responsible for the creation of gamma-ray bursts (GRBs) - routinely detected by electromagnetic observatories. Since already well before their merging, during the inspiral phase, the binary system emits GWs, analysis groups of GW detectors are performing in-depth search for such events around the time-window of known, already detected GRBs. These joint analysis are very important in increasing the confidence of a possible GW detection. It is widely expected that the first direct detection of GWs will happen in the next few years, and what is a matter of fact that the sensitivity of the next generation of GW detectors will allow us to 'see' a few hundred seconds of inspiral of the binary system before the merge - for specific mass parameter range. From the two above fact, it naturally follows, that one can (should!) turn around the logic and use the GWs emitted during the inspiral phase of a BNS coalescence process to predict, in advance the time and sky location of a GRB and set up constraints on the physical parameters of the system. Given the limited sensitivity of the detectors and the required high computational power, there exists no such prediction algorithm, as of today. Despite the fact that it is not yet feasible to use this new method with the current GW detectors, it will be of utmost importance in the late-Advanced LIGO/Virgo era and definitely for Einstein Telescope. The very goal of the research presented in this talk is to develop the above described zero-latency, BNS coalescence 'forecasting' method and set up and organize the associated alert system to be used by next generation of gravitational wave detectors and collaborating EM observatories. NIIF HPC services for research and education Tamás Máray – National Information Infrastructure Development Institute, Budapest (NIIF) The National Information Infrastructure Development Institute (NIIF) operates the largest supercomputers in Hungary, serving the HPC needs of all kind of Hungarian research projects since 2001. Currently 6 running supercomputers with more than 7000 CPU cores and more than 200 GPUs provide nearly 300 Tflop/s total computing performance for the benefit of the users of universities and academic research institutes. By the end of 2015 the performance will be doubled according to the current development plans. The NIIF HPC infrastructure is part of the European HPC ecosystem, called PRACE. In line with the international trends, more and more portion of our computing capacity is provided by GPGPUs. That's why the deep knowledge of GPUs and GPU applications but also the ability for writing efficient code for the accelerators became crucial importance for our user community. The presentation briefly introduces the HPC infrastructure and services of NIIF and gives information about how users can get access to the supercomputers. GPU-based Iterative Computed Tomography Reconstruction Zsolt Balogh – Mediso Ltd., Budapest In the last decades, the iterative 3D image reconstruction methods has become an intensively developed research area of medical image processing. In general, these methods require more computing capacity as other reconstruction methods, but they can reduce the noise level and increase the quality of the reconstructed image. The algorithms developed in this area are very complex, but they can be more efficient and faster using GPU parallel computing. At Mediso Ltd. we started to develop a GPU-based iterative Computed Tomography reconstruction algorithm. To harness the full potential of the GPU's we use CUDA which is a parallel computing platform that enables using a GPU for general purpose. In my talk I would like to present some techniques and problems that occur during the implementation process. Fast patient-specific blood flow modelling on GPUs Gábor Závodszky – Budapest University of Technology and Economics, Department of Hydrodynamic Systems, Budapest (BME HDS) Pathologic vessel malformations represent a significant portion of cardiovascular diseases (the leading cause of death in modern societies). The pathogenesis of these diseases, as well as their treatment methods, are strongly related to the properties of the blood flow emerging inside the concerned vessels. Thus, our main objective is to acquire the emergent flow field accurately using a patient-specific geometry within a clinically relevant time-frame. For carrying out the computations we employed an implementation of the lattice Boltzmann method (LBM). The key difference for this technique lies in its highly parallel property. Because the statistical description of the dynamics of the particle ensembles holds more information in every space coordinate compared to the usual macroscopic description of the flow, more computation can be done on a single numeric cell without requiring information from the other cells, thus many parts of the computation can be split into perfectly parallelizable chunks, making the method nearly ideal to run on a highly parallel hardware. The programmable video cards (GPUs) present a great match as a highly parallel hardware capable of delivering a raw computational performance previously only available to supercomputers. Using this approach the typical computational timeframe can be reduced from hours to minutes on a desktop machine, which opens the possibility of integrating computational fluid dynamics (CFD) based examinations on medical workstations. Pattern Generation for Retina Stimulation Experiments László Szécsi – Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Budapest (BME HDS) The mechanisms of how the cells in the retina process light are varied, complex, and under heavy research. Retina cells are studied by placing retinas surgically removed from test animals on multi-electrode arrays, while projecting carefully designed light patterns on them. This talk elaborates on the process of light pattern generation. We study the system requirements arising from the measurement setup and classify the kinds of visual stimuli widely used in research. In order to meet the image processing capabilities required to meet researcher needs, we introduce a GPU framework, and propose a filtering methods for the required stimuli. QCD on the lattice Ferenc Pittler – Hungarian Academy of Science - Eötvös University Lendület Lattice Gauge Theory Group, Budapest In numerical simulation of quantum chromodynamics we use the GPU cluster at the Eötvös Loránd university. In most of the simulation time we have to solve 𝐴𝑥 = 𝑏 like equations with a sparse matrix 𝐴. The commonly used technique for solving this equation is some version of the conjugant gradient algorithm. In the present talk we discuss an alternative algorithm for overlap fermions, which is based on a general idea: preconditioning with a cheaper fermionic discretization, augmented with a domain decomposition multigrid method. Hierarchical Bayesian Method with MCMC Integration on GPUs János Márk Szalai-Gindl – Eötvös University, Department of Complex Systems, Budapest (ELTE) We present a general hierarchical Bayesian method for estimating (latent) characteristics of each object in a population and population-level parameters. The observed data is the measurement of characteristics with some noise. This method can be useful, for example, when the first goal is the inference of the distribution of population parameters based on the observed data and the other goal is the computation of the conditional expected value for each object characteristic. Posterior sampling approach can be involved for these purposes where MCMC algorithms can be used. The next state of Markov chain of estimated characteristic can be computed on GPU cores in parallel manner for each object because these are independents. The presentation will delineate the models and applied methods. Efficient parallel generation of random p-bits for multispin MC simulation István Borsos – Institute for Technical Physics and Materials Science, Centre for Energy Research, Hungarian Academy of Sciences, Budapest (EK) Multispin MC simulations of certain stochastic models often require decisions with the same probability p for many simultaneous state transitions. As an example, consider the deterministic and stochastic versions of the various cellular automata rules of Wolfram. Usually the deterministic version is easily and very efficiently simulated by multispin coding exploiting the bit parallelism in long computer words. In the stochastic version, however, each of the deterministically computed state changes, before being accepted, is subjected to a random acceptance decision of fixed probability p. The straightforward standard solution is to serialize this part of the simulation by making decisions one by one on the deterministically computed transitions. This approach, however, loses the speed gains of the multispin coding. To solve this, we present here an algorithm to generate vectors of bits, each of them with probability p, efficiently (in time) and economically (in terms of its use of fair random bits). These vectors can be directly and simply combined with the multispin deterministic part of the simulation, maintaining the advantages of bit-parallelism. The algorithm needs very few hardware resources so it is usable in resource limited GPU cores, as well. Accelerated Monte-Carlo Particle Generators for the LHC Gergely Gábor Barnaföldi – Wigner Research Centre for Physics, Hungarian Academy of Sciences, Budapest (Wigner RCP) AliROOT is a Monte Carlo based event generator and simulation framework of the CERN LHC ALICE experiment that plays a central role in the theoretical investigations and detector design simulations. As simulation and reconstruction of particle tracks consumes large amount of computing power any acceleration is very welcome in this field. One of the central parts of the Monte-Carlo simulation is the Pseudo-Random Number Generator (PRNG). In this work we ported the Mersenne-Twister algorithm to GPUs and added it as a new selectable generator into the AliROOT framework. This makes possible to utilize the GPUs in the LHC Computing Grid system. Accelerating the GEANT Particle Transport System with GPUs Gábor Bíró – Eötvös University, Budapest (ELTE) / Wigner Research Centre for Physics, Hungarian Academy of Sciences, Budapest (Wigner RCP) High Energy Physics (HEP) needs a huge amount of computing resources. In addition data acquisition, -transfer, and -analysis require a well-developed infrastructure too. In order to prove new physics disciplines it is required to higher the luminosity of the accelerator facilities, by which we can produce more-and-more data in the future experimental detectors. Both testing new theories and detector R&D are based on complex simulations. Today have already reach that level, that the Monte Carlo detector simulation takes much more time than real data collection. This is why speed up of the calculations and simulations became important in the HEP community. The Geant Vector Prototype (GeantV) project aims to optimize the mostused particle transport code, applying parallel computing and exploit the capabilities of the modern CPU and GPU architectures as well. With the maximized concurrency at multiple levels, the GeantV is intended to be the successor of the Geant4 particle transport code that has been used since two decades successfully. Here we present our latest result on the GeantV tests performances, comparing CPU/GPU based vectorized GeantV geometrical code to the standard Geant4 version. Code-Generation for Differential Equation Solvers Dániel Berényi – Wigner Research Centre for Physics, Hungarian Academy of Sciences, Budapest (Wigner RCP) In modern HPC computing code-generation is becoming a ubiquitous tool for minimalizing program development time while maximizing the effectiveness of hardware utilization. In this talk we present a framework under development at the Wigner GPU Lab for generating parallelized numerical solver codes targeting GPUs. One part of the research is targeting the representation and manipulation of the symbolic set of equations given by the user, the other one is focusing on the abstract representation of the numerical solver program. The GPU code generation part currently supports C++/OpenCL as its back-end. We review some implementation considerations and early applications in the area of statistical physics.
© Copyright 2025