The Bulk Multicore: High-Performance, Programmable Shared Memory

Participants: Josep Torrellas

Executive Summary

The goal of the Bulk Multicore Architecture is to enable a scalable shared-memory substrate that provides a highly programmable environment, while delivering high performance and keeping the hardware simple. The Bulk Multicore is described in a white paper and in a set of slides. It builds on recent work on shared-memory multiprocessor architectures in the i-acoma research group of Josep Torrellas.

Goals - Extended Description

The Bulk Multicore advances usability and programmability by efficiently supporting both the novel disciplined software and currently-existing software stacks - including those for which performance is paramount. It provides novel hooks and support for a sophisticated development and debugging environment. Such environment will have an overhead low enough to be on all the time - including when the user runs production codes. In addition, by providing hardware cache coherence, it removes the burden of managing data sharing from the programmer or run-time system software. Finally, by supporting high-performance sequential consistency, it provides a more usable platform for the software.

The Bulk Multicore builds on the recently-proposed BulkSC architecture fabric. It also builds on mechanisms for state buffering and undo, as in thread-level speculation designs, and on scalable coherence protocol designs.

The idea behind the Bulk Multicore is to eliminate the need to commit one instruction at a time, which is an important source of design complexity in a multiprocessor environment. In the Bulk Multicore, the default execution mode of a processor is to commit only chunks of instructions at a time. A chunk is a dynamically defined group of consecutively-executing instructions, for example 2,000 consecutive instructions. Such chunked mode of execution and commit is invisible to the software running on the processor. The proposed architecture supports cache-coherent shared memory in a scalable way. Cache coherence is maintained with hardware-supported address signatures, without the need to send individual cache-line invalidation messages. Cache coherence combines high performance with ease-of-programming.

A key goal of the project is to develop novel architectural designs to provide a highly programmable environment for general-purpose shared memory. We are building on recent work in the i-acoma research group. The Bulk Multicore provides a highly programmable environment for two reasons. The first one stems from its support for sequential memory consistency, even for programs with data races. The second reason is the fact that development and debugging tools do not need to record or be concerned with individual loads and stores — only with chunks. This enables very low-overhead debugging techniques, such as those for deterministic replay of parallel programs and for data race detection — opening the door to a sophisticated, always-on debugging framework for production runs.

Our goal is to develop this flexible architecture further and integrate it with the other layers in UPCRC Illinois.

The Bulk Multicore

DeNovo

Rigel