HeteroBench: Multi-kernel Benchmarks for Heterogeneous Systems
As Moore's Law and Dennard scaling reach their limits, heterogeneous computing with CPUs, GPUs, and FPGAs has become increasingly important, yet comparing these platforms remains challenging due to their distinct architectures and programming environments. To address this, we present HeteroBench, a comprehensive benchmark suite that enables cross-platform evaluation of multi-compute kernel applications across various accelerators and programming frameworks, including Python, C++, OpenMP, OpenACC, CUDA, and Vitis HLS. Our vendor-agnostic suite covers diverse domains from image processing to machine learning, providing practical insights for both academic researchers and industry practitioners in optimizing their HPC and AI/ML deployments.
An MLIR-based Compiling Flow for Heterogeneous Architecture
Original PyLog is a Python-based compiling flow linking high-level Python algorithm development to FPGA hardware designs, compiling Python to HLS C++ code via a Python-implemented IR. This work extends these functionalities to heterogeneous hardware accelerators by transitioning to MLIR (Multi-Level Intermediate Representation). Leveraging MLIR for back-end optimization meets the diverse needs of different hardware platforms. The approach features an exclusive dialect for high-level representation, which can be lowered to various MLIR dialects and ultimately to LLVM IR supported by HPE’s Cray Compiler Environment (CCE). For CPUs and GPUs, it generates machine code using CCE; for FPGAs, it uses ScaleHLS to generate HLS C++ code, continuing development with Xilinx tools. Future updates can easily integrate QIR (LLVM-based IR for quantum programming) to support Quantum Computing.
In progress
A Sparsity-Aware Autonomous Path Planning Accelerator with Algorithm-Architecture Co-Design
Path planning in autonomous driving systems is crucial but challenging due to the need for real-time constraints and computationally intensive solvers. The quadratic programming (QP) solver, central to most path planners, demands significant CPU resources. This paper presents an FPGA-based acceleration framework for path-planning problems using an operator splitting solver for quadratic programs (OSQP) and the preconditioned conjugate gradient (PCG) method for solving linear systems. The approach includes customized memory management, parallel processing, and pipelining for enhanced throughput. The FPGA implementation achieves state-of-the-art performance, with speedups of 1.98x on Intel i7-11800H, 3.90x on ARM Cortex-A57, and 12.3x on NVIDIA RTX 3090 GPU.
PyAIE: a Python-based Programming Framework for Versal ACAP AI Engines
To fill the gap of programming abstractions of application and AI Engine, we propose PyAIE, a Python-based programming framework specifically targeting AI Engines in the Versal ACAP. PyAIE allows users to focus on algorithm-level designs without knowledge of the underlying low-level details. PyAIE automatically translates Python code into the optimized AI Engine kernel C/C++ code, host code, along with configuration script files, thereby completing the entire AI Engine-based system design. To the best of our knowledge, this is the first Python-based programming and compilation flow designed specifically for Versal AI Engines.
Proposed Architecture
In-memory computing can save the time and energy of data movement between the memory and processor to avoid the memory-wall bottleneck of traditional Von-Neumann architecture. The associative processor (AP) is such an architecture that is proposed to implement in-memory computing. Content addressable memory (CAM), as a critical part of in-memory computing, plays an important role in an AP. In this paper, we proposed a novel FPGA implementation of the AP, including the CAM and its peripheral circuits, such as the controller, data cache, instruction cache, and program counter. To the best of our knowledge, this is the first work that implements an associative processor on a real-world FPGA platform.