Rigel: 1000+ Core Architectures for Throughput-Oriented Computing

Participants: Sanjay Patel, Daniel Johnson, Matthew Johnson, William Tuohy

Executive Summary

The Rigel Project is focused on the architectural and programming interface issues surrounding chips that scale to 1000s of cores. Using client‐side workloads that scale to high degrees of parallelism, we have defined a prototype 1000‐core architecture based on MIMD execution, shared caching, and software‐assisted coherence.

Goals - Extended Description

To date, our work is mainly at an infrastructural level - that of defining the architectural framework, understanding the applications‐space, creating a simple run‐time layer, and building compilation, simulation, and measurement tools.

There is a wealth of projects that can be done now that we have the infrastructure mostly in hand. Many of the open questions result from re‐thinking down to the microarchitectural level how a processor system should be designed given that the workload is expected to support massive parallelism. Some are very likely outcomes within the next academic year:

Understanding the implications of hardware multithreading from a performance, area, and energy perspective. Should HW‐MT degree be small (few context per processor) or large (100s)? Different chips have widely differing approaches to this question, and we can approach this question more fundamentally with our infrastructure. Furthermore, there is an interplay between degree of multithreading and the level of parallel decomposition supported by the chip's programming model. Finer grained models can possibly support higher degrees of multithreading. Opportunities for development include hardware optimizations to reduce the costs of large register files through software controlled sharing.
The predominant execution model for massively parallel chips is that of SIMDbased execution. However, several well‐studied and important applications demonstrate poor SIMD efficiency, even at modest vector lengths. In this research thrust we explore the tradeoffs between SIMD and MIMD execution at the hardware level and propose several optimizations that significantly reduce the costs of MIMD execution via shared caching and code transformations to increase instruction run length.
A dominant performance factor for many‐core chips is memory bandwidth. Examining intrinsic factors that govern bandwidth (pins, signaling rates, and interface power), the situation is likely to get worse in successive technology nodes. We examine some hardware/programming model/software optimization strategies to improving memory bandwidth by smart scheduling, coalescing of requests, and automatic transformations for improving data reuse in task‐based data parallel programming models.

We expect there to be considerable synergy between this project and several other UPCRC Illinois applications projects. Within the next year, we will port some of the application kernels and libraries developed in the DIBR work and the teleimmersion work to the Rigel programming APIs. We also expect there to be synergy between Rigel and Wen-mei Hwu's work on tools/compilation frameworks to support task‐level programming models. We are currently using LLVM as our low‐level code generation framework.

The Bulk Multicore

DeNovo

Rigel