The continued scaling of feature sizes will enable systems with hundreds of conventional cores, and possibly thousands of light-weight cores, within a decade. Current cache coherence protocols do not scale to such numbers. A major reason is that, with current protocols, each shared memory access by a core is considered to be a potential communication/synchronization with any other core. In fact, parallel programs communicate and synchronize in stylized ways. A key to shared memory scaling is the adjustment of coherence protocols to leverage the prevalent structure of shared memory codes for performance. We are exploring three approaches to do so.
The Bulk Architecture is executing coherence operations in bulk, committing large groups of loads and stores at a time. In this architecture, memory accesses appear to interleave in a total order, even in the presence of data races—which helps software debugging and productivity—while the performance is high through aggressive reordering of loads and stores within each group.
The DeNovo architecture coherence and communication involves co-designing hardware with concurrency-safe programming model, and the Rigel architecture plans to shift more coherence activities to software. In addition, a higher-level view of communication and synchronization across threads enables the architecture to help program development, e.g., by supporting deterministic replay of parallel programs, or by tracking races in codes that do not prevent them by design.
- The Bulk Multicore: High-Performance, Programmable Shared Memory
- DeNovo: Rethinking Hardware for Disciplined Parallelism
- Rigel: 1000+ Core Architectures for Throughput-Oriented Computing