Summer School


Visiting I2PC at the University of Illinois



I2PC Distinguished Speaker Series
2012-2013 Seminar Details

The Distinguished Speaker weekly seminar series features the latest I2PC research developments. Speakers include I2PC research faculty, industry research affiliates, post-doctoral researchers, and graduate research assistants.

The seminars held on Thursdays from 4-5 pm in the Siebel Center. Unless otherwise noted, seminars will be held in Room 2405. Speakers, dates, and locations will be added as the year progresses, so check back for updates!

Note: Some talks are I2PC-internal, meaning only people who have signed the Informed Participation Agreement can attend. If you would like to attend or access archived media from these talks, please ensure you have signed the agreement, then email the I2PC Team for the username and password to the archive. If you are an Intel employee, you are not required to sign the Participation Agreement. Please email from your Intel account for the username and password.


For those unable to physically attend the talk, a live webcast is available. Additionally, a chat interface has been set up for those who want to ask the speaker questions during the talk.

In the event that a talk is I2PC-internal, a single-use webcast address will be emailed to I2PC members prior to the talk.

Upcoming Seminars


May 16: Brad Chamberlain (Cray, Inc.)
Hierarchical Locales: Exposing Node-Level Locality in Chapel
NOTE LOCATION CHANGE: This talk will be held in room 3405.

Abstract: Chapel is an emerging programming language whose design and development is being led by Cray Inc. with collaboration from academia, government, and industry. Chapel has the goal of significantly improving programmer productivity on HPC systems.

When I had the opportunity to visit UIUC last spring, I gave a general overview of Chapel, its feature set, and its multiresolution philosophy. In this talk, I'll start with a brief introduction to Chapel for those who weren't in attendance last year and then dive into a topic that we've been working on over the past year or so to introduce hierarchical locality concepts into Chapel.

The motivation for this work comes from the increasing amount of hierarchy and heterogeneity that is present in compute node architectures as clock speeds have levelled off. As a result of these trends, performance-minded programmers must increasingly pay attention to locality within a compute node in order to maximize their throughput. As an impact of these trends, the abstract machine model to which the HPC community has programmed for decades will need to undergo a fairly radical shift to reflect such architectures.

Our extant programming models are not particularly prepared for this shift, as they typically have conflated parallelism and locality (as in SPMD models like MPI, UPC, or Co-Array Fortran), or have ignored locality altogether (as in shared memory models like OpenMP, Pthreads, etc.). The consequence is that many programmers are increasingly turning to hybrid parallel programming models in order to make use of the multiple levels of harwdare parallelism --- an approach that leaves much to be desired.

The hypothesis of this talk is that languages which support distinct concepts for talking about parallelism versus locality, like Chapel, are better positioned for the next generation of parallel computing. In this talk, I'll detail a new concept that we are pursuing in Chapel, "hierarchical locales", which is designed to address these increasingly complex node architectures.

Bio: Bradford Chamberlain is a Principal Engineer at Cray Inc. where he works on parallel programming models, focusing primarily on the design and implementation of the Chapel language in his role as technical lead for that project. Brad received his Ph.D. in Computer Science & Engineering from the University of Washington in 2001 where his work focused on the design and implementation of the ZPL parallel array language. His thesis explored the concept of 'regions' in ZPL --- a first-class index set supporting global-view data parallelism and a syntactic performance model. While at UW, he also worked on developing algorithms for accelerating the rendering of complex 3D scenes. Brad remains associated with the University of Washington as an affiliate faculty member; most recently, he taught a class for the department's Professional Masters Program on Parallel Computation during Winter quarter 2013. In the past, Brad has also worked briefly on languages for embedded reconfigurable processors. He received his Bachelor's degree in Computer Science with honors from Stanford University in 1992.

Past Seminars

May 02: James Laudon (Google)
Warehouse-scale Computing: Challenges and Opportunities
NOTE TIME CHANGE: This talk will be held from 3-4pm, in room 2405.

Abstract: Warehouse-scale computers power the services offered by companies such as Google, Facebook, Amazon, Yahoo, and Microsoft’s online services division. They differ significantly from traditional datacenters: they belong to a single organization, use a relatively homogeneous hardware and system software platform, and share a common systems management layer. Most importantly, WSCs run a smaller number of very large applications (or Internet services), and the common resource management infrastructure allows significant deployment flexibility. The requirements of homogeneity, single-organization control, reliability, and enhanced focus on cost-efficiency motivate designers to take new approaches in constructing and operating these systems. In this talk, I'll discuss some of the challenges with WSCs, such as maintaining high availability and low "tail" latency for all the services on which your computation depends. WSCs also bring many opportunities, and I'll outline some of those opportunities and discuss a couple of ongoing projects at Google Madison pursuing them.

Bio: James Laudon is the Site Director for the Madison Google office. His areas of expertise include multithreading, multiprocessors, system software, distributed computing, and performance modeling. He is currently focused on advanced hardware and system software development for Google’s datacenters. Prior to Google, James was a Distinguished Engineer with Sun Microsystems and led the architecture of several generations of the UltraSPARC Tx chip multiprocessor line. He joined Sun through their acquisition of Afara Websystems, where he managed the architecture and performance team. Prior to Afara, he worked at Broadcom on wired and wireless networking chips, at a superscalar DSP startup, and at Silicon Graphics, where he architected the SGI Origin 2000. James has a B.S. in Electrical Engineering from the University of Wisconsin – Madison and a M.S. and Ph.D. in Electrical Engineering from Stanford University. While at Stanford, James was co-architect of the Stanford DASH multiprocessor and in his Ph.D. dissertation he proposed interleaved multithreading, the multithreading technique employed by the original UltraSPARC T1 chip multiprocessor.

May 01: Dr. Xiaowei Shen (IBM Research - China)
Innovation in China: From Internet-of-Things to Big Data
NOTE DAY AND TIME CHANGE: This talk will be held on a Wednesday, in room 2405, from 3:30-4:30pm. This talk will not be broadcast.

Abstract: I will give an introduction of IBM Research - China, including our research focus, our technical agenda, and our innovation model. I will discuss some of our research projects,from workload optimized systems to software defined environment, from renewable energy forecasting to connected vehicles service platform, from internet-of-things infrastructure to internet-of-things data analytics. All students are welcome to attend to learn more about IBM Research and innovations in growth markets. IBM Research - China, located at Beijing and Shanghai, was founded in 1995 as the first research institute established by multi-national corporations in China. The lab has a broad research agenda including systems, software, services, and industry solutions, with close collaboration with academia, industry clients, and governments. IBM Research - China focuses on technical innovation and go-to-market exploration of big data, internet-of-things, social analytics, and cloud computing. IBM Research - China seeks PhD and Master Students from top schools with strong background in engineering and computer science.

Bio: Dr. Xiaowei Shen is the Director of IBM Research - China. Before his assignment in China, Dr. Shen was a Research Staff Member at IBM T. J. Watson Research Center. Dr. Shen received his PhD degree in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology, and his BS degree in Computer Science from the University of Science and Technology of China. His research interests include computer architectures, software and hardware co-design, distributed computing, and innovations for big data and cloud computing. Dr. Shen led IBM’s Global Technology Outlook on Internet-of-Things, and is currently the Principal Investigator of the Research Big Bet on Internet-of-Things at IBM.

Apr 30: Luke Tierney (University of Iowa)
Some New Developments for the R Engine

Abstract: R is a dynamic language for statistical computing and graphics. In recent years R has become a major framework for both statistical practice and research. This talk presents a very brief outline of the R language and its evolution and describe some current efforts on improvements to the core computational engine, including work on compilation of R code, efforts to take advantage of multiple processor cores, and modifications to support working with larger data sets.

Bio: Luke Tierney is the Ralph E. Wareham Professor of Mathematical Science and Chair, Department of Statistics and Actuarial Science, at the University of Iowa.

He received his Ph.D in Operations Research from Cornell University in 1980. He was at Carnegie Mellon University from 1980-1984 and the University of Minnesota from 1984-2002. His main areas of interest are: Computational methods for Bayesian data analysis, and computing environments for statistics.

He is the developer of Lisp-Stat and is a member of the R Core development team.


Apr 25: Xuehai Qian (University of Illinois)
Volition: Scalable and Precise Sequential Consistency Violation Detection

Abstract: Sequential Consistency (SC) is the most intuitive memory model, and SC Violations (SCVs) produce unintuitive, typically incorrect executions. Most prior SCV detection schemes have used data races as proxies for SCVs, which is highly imprecise. Other schemes that have targeted data-race cycles are either too conservative or are designed only for two-processor cycles and snoopy-based systems. In this talk, I will present Volition, the first hardware scheme that detects Sequential Consistency Violations (SCVs) in a relaxed-consistency machine precisely, in a scalable manner, and for an arbitrary number of processors in the cycle. Volition enhances programmability, while inducing negligible traffic and execution overhead.

Bio: Xuehai Qian is a Ph.D candidate in the Department of Computer Science at the University of Illinois, Urbana-Champaign. His research focuses on multicore and parallel computer architecture, and programming models for parallelism. He received an MS in Computer Science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), and a BS in Computer Engineering from Beihang University, Beijing.

Apr 18: Milos Gligoric (University of Illinois)
Model Checking Database Applications

Abstract: We describe the design of DPF, an explicit-state model checker for database-backed web applications. DPF interposes between the program and the database layer, and precisely tracks the effects of queries made to the database. We experimentally explore several implementation choices for the model checker: stateful vs. stateless search, state storage and backtracking strategies, and dynamic partial-order reduction. In particular, we define independence relations at different granularity levels of the database (at the database, relation, record, attribute, or cell level), and show the effectiveness of dynamic partial-order reduction based on these relations. We apply DPF to look for atomicity violations in web applications. Web applications maintain shared state in databases, and typically there are relatively few database accesses for each request. This implies concurrent interactions are limited to relatively few and well-defined points, enabling our model checker to scale. We explore the performance implications of various design choices and demonstrate the effectiveness of DPF on a set of Java benchmarks. Our model checker was able to find new concurrency bugs in two open-source web applications, including in a standard example distributed with the Spring framework.

Bio: Milos Gligoric is a PhD student in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He works with Prof. Darko Marinov on Software Testing and Software Model Checking. Milos' work has been supported by the C.L. & Jane W-S. Liu Award, Max Planck Institute internship, Saburo Muroga fellowship, NASA/MCT internship, IBM X10 innovation award, and Intel internship.

Apr 04: Konstantine Karantasis (University of Illinois)
Parallel Reordering: Graph Processing on the Dark Side of BFS
Video | Slides

Abstract: Reordering of sparse matrices and graphs are standard preprocessing procedures for a plethora of applications, most notably those for sparse linear algebra. Many heuristic algorithms have been proposed for addressing this computationally hard problem. Since these algorithms have been considered efficient in comparison to subsequent computations, such as factorization, little work has gone into creating parallel implementations. However the introduction of highly parallel numerical methods for sparse matrices as well as the demand for applying reordering to increasingly large graphs are rendering reordering algorithms a potential bottleneck.

In this talk, I'll present the first parallel implementations of two widely used reordering algorithms: Reverse Cuthill-McKee (RCM) and Sloan. These algorithms are based on performing specific graph traversals. Recent work has successfully parallelized breadth-first search (BFS), a less constrained graph traversal than either RCM or Sloan. We evaluated our parallelizations on multicore systems and with large matrices from several domains. Our implementations present significant performance improvements compared to a state-of-the-art sequential library, without compromising the quality of reordering. Our results demonstrate that it is possible to achieve parallel speedup for several graph traversal-based reordering algorithms beyond BFS, even for highly constrained ones such as RCM and Sloan.

Bio: Konstantine Karantasis is a visiting lecturer in ECE at the University of Illinois, Urbana-Champaign and a postdoctoral research associate in Coordinated Science Lab. He received his PhD in Computer Engineering at the University of Patras, Greece in 2011. His ongoing research focuses in runtime, compiler and system optimizations that improve the performance of irregular algorithms and accelerate applications of big data requirements.

Mar 28: Amin Ansari (University of Illinois)
Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand
Video | Slides

Abstract: Power dissipation limits combined with increased silicon integration have led microprocessor vendors to design chip multiprocessors (CMPs) with relatively simple (lightweight) cores. While these designs provide high throughput, single-thread performance has stagnated or even worsened. Asymmetric CMPs offer some relief by providing a small number of high-performance (aggressive) cores that can accelerate specific threads. However, threads are only accelerated when they can be mapped to an aggressive core, which are restricted in number due to power and thermal budgets of the chip. Rather than using the aggressive cores to accelerate threads, this paper argues that the aggressive cores can have a multiplicative impact on single-thread performance by accelerating a large number of lightweight cores and providing an illusion of a chip full of aggressive cores.

Specifically, we propose an adaptive asymmetric CMP, Illusionist, that can dynamically boost the system throughput and get a higher single-thread performance across the chip. To accelerate the performance of many lightweight cores, those few aggressive cores run all the threads that are running on the lightweight cores and generate execution hints. These hints are then used to accelerate the execution of the lightweight cores. However, the hardware resources of the aggressive core are not large enough to allow the simultaneous execution of a large number of threads. To overcome this hurdle, Illusionist performs aggressive dynamic program distillation to execute small, critical segments of each lightweight-core thread. A combination of dynamic code removal and phase-based pruning distill programs to a tiny fraction of their original contents. Experiments demonstrate that Illusionist achieves 35% higher single thread performance for all the threads running on the system, compared to a CMP with all lightweight cores, while achieving almost 2X higher system throughput compared to a CMP with all aggressive cores.

Bio: Amin Ansari is currently a National Science Foundation Computing Innovation Fellow and Postdoctoral Research Associate in the Computer Science Department of the University of Illinois at Urbana-Champaign, working with Prof. Josep Torrellas. His research interests lie in the area of computer architecture with more focus on reliability and low-power design. He is working on microarchitectural solutions for on-chip caches, processor pipeline, and network-on-chip to tackle deep sub-micron technology challenges such as power density, process variation, manufacturing defects, and wearout. He received the Ph.D. degree in Computer Science and Engineering from the University of Michigan under Prof. Scott Mahlke in 2011. He received the B.S. degree in computer engineering from Sharif University of Technology in 2007. In addition, Amin has published more than 20 papers in top-tier journals and international conferences such as IEEE Transactions on Computers, ISCA, HPCA, MICRO, and DSN. His academic achievements were recognized by 2010 College of Engineering Distinguished Achievement Award during his graduate studies at the University of Michigan. He received the best paper award at the 27th IEEE International Conference on Computer Design in 2009.

Mar 07: Hyojin Sung (University of Illinois)
DeNovoND: Efficient Hardware Support for Disciplined Parallelism
Video | Slides

Abstract: Recent work has shown that disciplined shared-memory programming models that provide deterministic-by-default semantics can simplify both parallel software and hardware. Specifically, the DeNovo hardware system has shown that the software guarantees of such models (e.g., data-race-freedom and explicit side-effects) can enable simpler, higher performance, and more energy-efficient hardware than the current state-of-the-art for deterministic programs. Many applications, however, contain non-deterministic parts; e.g., using lock synchronization. For commercial hardware to exploit the benefits of DeNovo, it is therefore necessary to extend DeNovo to support non-deterministic applications.

This paper proposes DeNovoND, a system that supports lock-based, disciplined non-determinism, with the simplicity, performance, and energy benefits of DeNovo. We use a combination of distributed queue-based locks and access signatures to implement simple memory consistency semantics for safe non-determinism, with a coherence protocol that does not require transient states, invalidation traffic, or directories, and does not incur false sharing. The resulting system is simpler, shows comparable or better execution time, and has 33% less network traffic on average (translating directly into energy savings) relative to a state-of-the-art invalidation-based protocol for 8 applications designed for lock synchronization.

Bio: Hyojin Sung is a Ph.D student in Computer Science at University of Illinois, Urbana-Champaign. Her research interests are in parallel computer architecture, compilers and programming, especially SW/HW co-design based on parallel programming patterns. She earned her undergraduate degrees in Literature and Computer Science at Seoul National University in South Korea and worked for Samsung Electronics as a research engineer for two years, before she obtained her M.S. in Computer Science at UC San Diego in 2008 with her thesis on parallelizing compilers.

Mar 01: Natalie Enright Jerger (University of Toronto)
Optimizations for Cache-Coherent Networks-on-Chip

Abstract: As transistors continue to scale according to Moore's Law, efficient and scalable communication mechanisms will be required to realize the performance potential of many-core architectures. The increased demand for on-chip communication and the poor scaling of long global wires have made packet-switched networks-on-chip (NoC) a compelling choice for the communication backbone in these next-generation systems. Current NoC architectures are largely agnostic to the communication demands of the applications and the underlying architecture. In this talk, I will discuss research which explores increasing the functionality within the NoC to better match the demands of the coherence protocol. First, I will present a novel flow control technique that improves performance and buffer utilization in the face of short coherence control packets. Short control packets arise in NoCs due to abundant wiring resources. Second, I will present NoC support for routing collective communication. Collective communication, such as broadcast, multicast and reduction is often required by the coherence protocol. We propose light-weight multicast-reduction support that reduces network load which in turn improves overall performance.  Our multicast-reduction support allows NoCs to better match the needs of current and emerging applications. These NoC optimizations for cache coherence protocols can provide low-latency, high bandwidth communication with low overhead.

Bio: Natalie Enright Jerger joined the Edward S. Rogers Sr. Department of Electrical and Computer Engineering at the University of Toronto as an Assistant Professor in 2009.  Prior to joining the University of Toronto, she received her MSEE and PhD from the University of Wisconsin-Madison in 2004 and 2008 respectively.  She received her Bachelor's degree from Purdue University in 2002.  Her current research explores performance and power optimizations for sharing and communication patterns in on-chip networks and cache coherence protocols for many-core architectures.  She is also interested in improving the programmability of many-core architectures. In 2009, she co-authored a book on On-Chip Networks with Li-Shiuan Peh. Her research is supported by NSERC, Intel, CFI, AMD and Qualcomm.

Feb 20: Danny Dig (University of Illinois)
Interactive Program Transformations

Talk details can be found here.

Jan 24: Nima Honarmand (University of Illinois)
Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism
Video | Slides

Abstract: Architectures for deterministic record-replay (R&R) of multithreaded codes are attractive for program debugging, intrusion analysis, and fault-tolerance uses. However, very few of the proposed designs have focused on maximizing replay speed -- a key enabling property of these systems. In addition, those that have, require intrusive hardware or software modifications, or target whole-system R&R rather than the more useful application-level R&R. This paper presents the first hardware-based scheme for unintrusive, application-level R&R that explicitly targets high replay speed. Our scheme, called Cyrus, requires no modification to commodity snoopy cache coherence. It introduces the concept of an on-the-fly software Backend Pass during recording which, as the log is being generated, transforms it for high replay parallelism. This pass also fixes-up the log, and can flexibly trade-off replay parallelism for log size. We analyze the performance of Cyrus using full system (OS plus hardware) simulation. Our results show that Cyrus has negligible recording overhead. In addition, for 8-processor runs of SPLASH-2, Cyrus attains an average replay parallelism of 5, and a replay speed that is, on average, only about 50% lower than the recording speed.

This is joint work with Nathan Dautenhahn, Samuel King, Josep Torrellas (UIUC), Gilles Pokam and Cristiano Pereira (Intel, Santa Clara) and will appear in ASPLOS 2013.

Bio: Nima Honarmand is a PhD student in the iacoma group at the Computer Science Department of the University of Illinois at Urbana-Champaign. He received his Master's degree from the University of Tehran and his Bachelor's from Sharif University of Technology, both in Tehran, Iran. He is currently researching co-designed mechanisms for hardware and system software to improve programmability of multicore systems.

Jan 17: Wonsun Ahn (University of Illinois)
Loop-Based Alias Speculation Using Atomic Region Support
Video | Slides

Abstract: Alias analysis is a critical component in many compiler optimiza- tions. A promising approach to reduce the complexity of alias analysis is to use speculation. The approach consists of performing optimizations assuming the alias relationships that are true most of the time, and repair the code when such relationships are found not to hold.

This work proposes alias speculation that leverages hardware support for atomic regions, which is becoming increasingly popular today. The use of atomic regions eliminates the need for recovery code, which limits the scope and aggressiveness of past speculative alias schemes. In addition, it greatly decreases the amount of alias checks that have to be performed at runtime, which used to slowdown execution. The potential of the new alias speculation is tested with Loop Invariant Code Motion (LICM) and Global Value Numbering (GVN) optimization passes.

This is work with Yuelu Duan and Josep Torrellas that will appear in ASPLOS 2013.

Bio: Wonsun Ahn is a postdoctoral research scientist at the Computer Science Department of the University of Illinois at Urbana-Champaign. He received his PhD degree from the same department at 2012 and his bachelor's degree from Seoul National University in Korea at 2004. His current research efforts are focused on making the hardware-software barrier more permeable by proposing hardware extensions to enable more compiler optimizations based on runtime information.

Nov 29: Rudolf Eigenmann (Purdue University)
Is OpenMP All You Need for Accelerators?

Abstract: New computer architectures have a tendency to introduce new programming models. CUDA had been introduced for programming GPUs. Newer efforts, such as OpenCL and OpenACC tried to introduce generality and portability across a broader class of accelerators. Now, the integration of OpenACC into the OpenMP standard is being considered. In this talk, I will present the results of an effort that pursued the suitability of OpenMP for GPUs from the start. Building on an advanced parallelizing compilers, OpenMP programs are translated to CUDA. By adding directives that control CUDA-specific features, we study the importance of adding such features to future OpenMP standards. An automatic tuner is a key capability of the translation process. It addresses the "achilles heel" of optimizing compilers, which is the challenge of making optimization decisions based on runtime knowledge. Such dynamic optimization support is particularly importance given the complexity of today's heterogeneous architectures.

Bio: Rudolf Eigenmann is a professor at the School of Electrical and Computer Engineering. He is a co-principal investigator for Information Technology in the Network for Earthquake Engineering Simulation (NEES) Operations project, 2009-2014. His research interests include optimizing compilers, programming methodologies and tools, performance evaluation for high-performance computers and applications, and cyberinfrastructures. Dr. Eigenmann received a Ph.D. in Electrical Engineering/Computer Science in 1988 from ETH Zurich, Switzerland.

Nov 8: Semih Okur (University of Illinois)
How do developers use parallel libraries?
Video | Slides

Abstract: Parallel programming is hard. The industry leaders hope to convert the hard problem of using parallelism into the easier problem of using a parallel library. Yet, we know little about how programmers adopt these libraries in practice. Without such knowledge, other programmers cannot educate themselves about the state of the practice, library designers are unaware of API misusage, researchers make wrong assumptions, and tool vendors do not support common usage of library constructs.

We present the first study that analyzes the usage of parallel libraries in a large scale experiment. We analyzed 655 open-source applications that adopted Microsoft’s new parallel libraries – Task Parallel Library (TPL) and Parallel Language Integrated Query (PLINQ) – comprising 17.6M lines of code written in C#. These applications are developed by 1609 programmers. Using this data, we answer 8 research questions and we uncover some interesting facts. For example, (i) for two of the fundamental parallel constructs, in at least 10% of the cases developers misuse them so that the code runs sequentially instead of concurrently, (ii) developers make their parallel code unnecessarily complex, (iii) applications of different size have different adoption trends. The library designers confirmed that our findings are useful and will influence the future development of the libraries.

Bio: Semih Okur is a second-year PhD student in computer science at the University of Illinois at Urbana-Champaign. He works with Prof. Danny Dig. He is currently focusing on understanding how programmers use parallelism in their code. Based on this understanding, his research goal is to develop techniques and software tools that improve parallel programmer productivity and make parallel programming accessible to all programmers. He received his bachelor's degree in computer engineering from Koç University, Turkey in 2011.

Nov 1: Milos Prvulovic (Georgia Tech)
Performance Debugging Support for the Many-Core Era

Abstract: In recent years, the performance increase trend has shifted from exponential increase in single-core performance to an exponential increase in the number of cores. Unfortunately, most software developers are unprepared or even incapable of creating applications whose performance scales well to many cores, and this is threatening to marginalize the benefits that users expect from new hardware. Instead of hoping that all programmers will suddenly choose to become hardware experts, our approach in addressing this problem is to design low-cost hardware mechanisms and software tools that identify scaling problems automatically and then report them to programmers in a way they can act on, i.e. in terms of what to do in the code to fix the problem.

In this talk, I will present two specific instances of such mechanisms and tools. First, I will present LIME, a framework for analyzing parallel programs and reporting the cause of load imbalance in application source code. This framework uses statistical techniques to pinpoint load imbalance problems stemming from both control flow issues (e.g., unequal iteration counts) and interactions between the application and hardware (e.g., unequal cache miss counts). We evaluate LIME on applications from widely used parallel benchmark suites, and show that LIME accurately reports the causes of load imbalance, their nature and origin in the code, and their relative importance. Second, I will present a set of hardware mechanisms for reporting the causes of excessive cache misses in an actionable way.

In the remainder of the talk, I will discuss possible future directions for this work, and briefly outline several other research directions that I am currently pursuing, such as hardware support for bidirectional debugging, dynamic information flow tracking, protection from physical tampering and snooping attacks, and advanced interconnects for many-core processors.

Bio: Milos Prvulovic is an Associate Professor in the School of CS at Georgia Tech. He received his PhD from the University of Illinois at Urbana-Champaign (2003), an NSF CAREER award (2005), and is a Senior Member of IEEE and ACM. His research area is computer architecture, with emphasis on support for programmability in multi- and many-core architectures, reliability, and security.

Oct 30: Robert Geva (Intel Corporation)
Vector Loops in Heterogeneous Cilk(™) Plus

Abstract: Cilk™ Plus is a C/C++ language extension for parallel programming. It provides several programming constructs to express parallelism for task and data parallel programming. It is being proposed towards the next C++ language standard and is implemented by commercial compiler products. This presentation focuses on a relatively less well understood part of the parallel programming solution: vector programming. Vector execution has been achieved so far mostly by scalar programming and reliance on compiler’s auto vectorization, sometimes with directives such as ivdep. Cilk Plus offers innovative support for explicit vector programming. This includes a language construct to express a vector loop, a language construct to express elemental functions, array notations and precise language semantics for vector execution, using the C++ standard terminology of sequenced before. The semantics clarify how vector loops are different from both scalar loops on the one hand and parallel loops on the other hand. Elemental functions are a novel programming construct that add modularity to vector programming. They allow the programmer to write a function such that when invoked from a vector loop, the code inside the function execute as if it was a part of the loop body. Elemental functions can also be used as a standalone programming construct, adding the well-known SPMD programming model to C/C++. Viewed in the SPMD light, the programmer writes an elemental function to express the operations to be performed on a single set of elements and deploys it on a data parallel collection of elements. The compiler projects the operations on short vectors and generates code that vectorizes across consecutive invocations of the function. As CPU vendors are integrating processor graphics into the same die, programming systems are emerging that support unified programming across the whole platform. Intel’s 3’rd generation Core™ processor integrates a graphics engine which is a vector machine, exposing vector registers with known length to SW. Therefore, this engine benefits from explicit parallel and vector programming. The Cilk™ Plus language supports both the CPU and the processor graphics, where the vector programming language constructs have identical syntax and semantics for both sides of the platform. The unified support for programming the CPU and processor graphics allows productive optimization of placement of code to execution units with relatively minor changes around the loops. The presentation includes several case studies of how vector programming is used to implement well-known algorithms, and the vast difference between the usages manifest the generality of the language construct. The presentation also shows measured data from production systems showing the performance gains archived with vector programming. In conclusion, language support for vector programming is a distinct, essential and performing portion of parallel programming.

Bio: Robert Geva is a principal engineer at Intel’s software and services group. Robert joined Intel in 1991 and has since developed an expertise in compilers and performance analysis and tuning for microarchitectures. Robert has worked on compiler optimizations for a variety of Intel microprocessor based systems, including the 80486, the Pentium Processor, the Pentium Pro Processor, Itanium, the Pentium 4 and Pentium M and core II Duo.

Currently, Robert is an architect in the development products division responsible for driving language extensions and programming models for parallel and heterogeneous programming. Robert has been involved with the development of Intel Cilk™ Plus and the offloading model for Intel® Xeon® Phi™. Robert has BA and MSc from the Technion, Israel institute of technology.

Oct 18: Hironori Kasahara (Waseda University)
Green Computing Using Automatic Parallelizing and Power Reducing Compiler with Multiplatform API for Homogeneous and Heterogeneous Multicores

Abstract: Low power multicore and manycore processors have been attracting much attention in wide variety of areas from smartphones, medical systems, automobiles, to cloud servers and Exa-scale computers. Waseda university founded the Green Computing Systems Research and Development Center in 2011 with the support of Japanese Ministry of International Trade and Industry (METI) to develop solar powered super-low-power multicore processors, compilers, multiplatform API and applications with industry such as Fujitsu, Hitachi, NEC, Mitsubishi, Renesas, Olympus, Denso and Toyota. Currently, OSCAR Multigrain Parallelizing Automatic Power Reduction Compiler and OSCAR API for homogeneous and heterogeneous multicore processors give us 92 times speedup using 128 core Power 7 SMP server Hitachi SR16000 against sequential processing for Earth Quake Simulation Program GMS written in Fortran, 55 times speedup using 64 cores for Cancer Treatment using Carbon Ion Radiotherapy Program written in C also on SR16000, 52 times speedup using 64 cores for JPEG-XR encoder on Tilera Tile64, 1.73 times speedup for MPEG2 encoding using 2 cores ARMv7 Qualcomm MSM8960 (Snapdragon) Android 4.0 for Smart Phones, 1.95 times speedup for engine control program using 2 core V850 multicore, 74% power reduction of real-time MPEG2 Decoding on 8 core homogeneous multicore RP-2, 70% power reduction of real-time optical flow computation using 12 cores (8 processors and 4 accelerators) on heterogeneous multicore RP-X and so on. This talk introduces the OSCAR compiler, OSCAR API version 2.0 and API analyzer which allows us to generate parallel machine codes for various multicore processors having just sequential Fortran or C compilers.

Bio: Dr. Hironori Kasahara is a Professor at Department of Computer Science and Engineering and Director of Advanced Multicore Processor Research Institute, Waseda University, Tokyo, Japan and a member of IEEE Computer Society Board of Governors.

He received a Ph.D. degree from Waseda University in 1985, and was a visiting scholar in the University of California at Berkeley in 1985, a fulltime assistant professor in 1986, associate professor in 1988 and professor in 1997 at Waseda University. Also, he was a visiting researcher at the University of Illinois at Urbana-Champaign, Center for Supercomputing R&D in 1989-90.

He led several Japanese National Projects such as METI/NEDO Advanced Parallelizing Compiler, Multicore for Real-time Consumer Electronics, Leading Research for Low Power Manycores. He served as a member of MEXT Earth Simulator Architecture Advisory Board, Next Generation Supercomputer Evaluation Committee, High Performance Computing Infrastructure Committee and so on.

He is currently leading METI Green Computing Systems Research and Development program aiming at developing solar powered super-low-power multicore and manycore processors for smartphones, automobiles, medical systems, cloud servers and supercomputing with industry.  He published over 189 reviewed papers, 29 symposium papers, 136 technical reports, with106 invited talks, 19 patents (39 patent applications) and 467 articles of newspapers, TV, magazines, web news and so on.

Also, he has served as a PC chair, a PC or a Publication Chair of many conferences supported by IEEE, ACM, IPSJ, such as SC, ICS, ASPLOS, PPoPP, ICPP, IPDPS, ICPADS, CONPAR, JSPP, LCPC and so on. He has received the IFAC World Congress Young Author Prize, the IPSJ Sakai Special Research Award, the Grand Prix runner-up prize at the 2008 LSI of the Year, Best Research Award at the Intel Asia Academic Forum and IEEE Computer Society Golden Core Member.

His research interests include parallelizing compilers, multicore and manycore architectures, green computing systems and their application to automobile, consumer electronics, medical systems, supercomputing and so on.

Oct 11: David Raila (University of Illinois)
A view of the Avascholar system from above and below, and experiences implementing a hard real-time robotic vision and image processing system on modern and older architectures

Abstract: Avascholar is a a real-world application testbed within I2PC that integrates 3D capture and processing, computer vision, human analytics such as emotion, age/race/gender and facial tracking, within an educational application. This talk will provide an high-level overview of key components and algorithms followed by a low-level discussion of performance sensitive portions of the code based on results from VTune.

The InvertNet Imaging System combines a hard real-time robotic camera positioning system integrated with real-time computer vision and image processing. I will give a brief overview of the system and present experiences running the system on an E2200 Allendale vs a i7-2600 Sandy Bridge.

Bio: David Raila is a Research Programmer for I2PC where he works on the Avascholar system, and for the NSF InvertNet project, designing and building the InvertNet imaging system. David has contributed to a variety of research projects and groups at Illinois over the last 25 years, including the Choices Object-Oriented Operating System, The Vosaic Multimedia streaming system, The Gaia Active Spaces system, The Illinois Security Laboratory, NCSA's Innovative Systems Lab. David has a M.S. in Computer Science and a B.S. in Electrical and Computer Engineering from the University of Illinois.

Oct 4: Hrabri Rajic and Abhishek Agrawal (Intel)
Optimizing Applications Idle Power

Abstract: The success of a mobile device in marketplace is determined mainly by user experience, which is correlated to device responsiveness and battery life in addition to esthetics and other subjective preferences. How to diagnose if the application is power optimized and how to achieve it is what is the subject of this talk. After giving the proper background, power profile from multiple apps will be shown to demonstrate how the software can increase the total platform idle power consumption. Multiple case studies involving real world apps will be discussed in depth to demonstrate how the issues causing high power were root-caused, possible solutions to fix the issues discovered, and reduction in total power achieved by using low power software techniques.

Bio: Hrabri Rajic has extensive industry experience in parallel linear algebra and distributed and parallel computing. He was a Chair of GGF DRMAA working group. He is currently involved at Intel in power optimizing Android on Intel(r) Atom(tm) platforms.

Abhishek R. Agrawal is a senior technical lead at Intel, driving Intel's initiatives on power efficiency for client & Intel(r) Atom(tm) processor-based platforms. He chairs the industry wide power management working group for Climate Savers Computing Initiative. He is one of the authors of "Energy Aware Computing" book.

Sep 27: Josep Torrellas (University of Illinois)
Vulcan: Hardware Support for Detecting Sequential Consistency Violations in Programs Dynamically
Video | Slides

Abstract: Past work has focused on detecting data races as proxies for Sequential Consistency (SC) violations. However, most data races do not violate SC. In addition, lock-free data structures and synchronization libraries often explicitly employ data races but rely on SC semantics for correctness. Consequently, to uncover SC violations, we need to develop a more precise technique.

This paper presents Vulcan, the first hardware scheme to precisely detect SC violations at runtime, in programs running on a relaxed-consistency machine. The idea is to leverage cache coherence protocol transactions to dynamically detect cycles in memory-access orders across threads. When one such cycle is about to occur, an exception is triggered. For the conditions considered in this work and with enough hardware, Vulcan suffers neither false positives nor false negatives. In addition, Vulcan induces negligible execution overhead, requires no help from the software, and only takes as input the program executable. Experimental results show that Vulcan detects three new bugs that are SC violations in the Pthread and Crypt libraries, and in the fmm code from SPLASH-2. Moreover, Vulcan's negligible execution overhead makes it suitable for on-the-fly use.

This work is done with Abdullah Muzahid and Shanxiang Qi, and will appear in the International Symposium on Microarchitecture (MICRO) in December 2012.

Bio: Josep Torrellas is a professor at the Computer Science Department of the University of Illinois. His research interests are multiprocessor computer architecture.

Sep 12: Ali-Reza Adl-Tabatabai (Facebook)
HipHop: High-Performance PHP
Poster | Video | Slides

Abstract: To enable fast software development, Facebook uses the PHP programming language. Although easy to learn and quick to develop with, PHP has significant performance overhead because its interpreted. This talk presents several tools that the HipHop team at Facebook has developed to improve PHP's performance. We first describe the HipHop compiler, a static compiler that translates Facebook's PHP codebase into a C++ program. The HipHop compiler has been deployed in Facebook's production environment since 2010. We then present the HipHop Virtual Machine, a new language VM that aims to bring high-performance JIT compilation to PHP.

Bio: Ali-Reza Adl-Tabatabai is a software engineer at Facebook, where he's a member of the HipHop compiler team. Prior to Facebook, Ali was a Director and Senior Principal Engineer at Intel Labs where he lead a research group developing new programming language technologies and their hardware support for future Intel Architectures. Ali holds 37 patents and has published over 40 papers in leading conferences and journals. Ali received his Ph.D in Computer Science from Carnegie Mellon University, and a Bachelor of Science in Computer Science and Engineering from the University of California, Los Angeles.

Sep 6: Haohui Mai (University of Illinois)
A Case for Parallelizing Web Pages
Poster | Video

Abstract: Mobile web browsing is slow. With advancement of networking techniques, future mobile web browsing is increasingly limited by serial CPU performance. Researchers have proposed techniques for improving browser CPU performance by parallelizing browser algorithms and subsystems. We propose an alternative approach where we parallelize web pages rather than browser algorithms and subsystems. We present a prototype, called Adrenaline, to perform a preliminary evaluation of our position. Adrenaline is a server and a web browser for parallelizing web workloads. The Adrenaline system parallelizes current web pages automatically and on the fly – it maintains identical abstractions for both end-users and web developers.

Our preliminary experience with Adrenaline is encouraging. We find that Adrenaline is a perfect fit for modern browser’s plug-in architecture, requiring only minimal changes to implement in commodity browsers. We evaluate the performance of Adrenaline on a quadcore ARM system for 170 popular web sites. For one experiment, Adrenaline speeds up web browsing by 3:95x, reducing the page load latency time by 14:9 seconds. Among the 170 popular web sites we test, Adrenaline speeds up 151 out of 170 (89%) sites, and reduces the latency for 39 (23%) sites by two seconds or more.

Bio: Haohui Mai is in his fifth-year of Ph.D. in University of Illinois at Urbana-Champaign. He is working with Professor Sam King. His current research focuses on improving both security and performance of mobile systems.

Aug 30: Stephen Heumann (University of Illinois)
The Tasks with Effects Model for Safe Concurrency

Poster | Video

Abstract: Concurrent programming is difficult, and today's widely-used concurrent programming models provide few safety guarantees, making it easy to write code with subtle errors. Proposed models offering stronger guarantees are often limited in the class of programs that they can express. In this talk, I will present a new concurrent programming model called tasks with effects, which offers strong safety guarantees while still providing the flexibility needed to support the many ways that concurrency is used in complex applications. This model has significantly greater expressivity than previous safe parallel languages, and can support actor-like programs and programs that combine concurrent and parallel components.

The core unit of work in the tasks with effects model is a dynamically-created task. The model's key feature is that each task has programmer-specified effects, and a runtime scheduler is used to ensure that two tasks are run concurrently only if they have non-interfering effects. Through the combination of statically verifying the declared effects of tasks and using an effect-aware runtime scheduler, the tasks with effects model is able to guarantee strong safety properties, including data race freedom and atomicity. It is also possible to statically prove that some programs in this model behave deterministically. I will describe the semantics of the tasks with effects model, as well as an implementation of it in an extended version of Java. I will also discuss an evaluation showing that it can express several programs exhibiting various patterns of concurrency, and that substantial parallel speedups can be achieved.

Bio: Stephen Heumann is a Ph.D. candidate in the department of Computer Science at the University of Illinois at Urbana-Champaign. His research interests include models and techniques for safe parallel programming, and he has worked in the past on the Deterministic Parallel Java project. Before joining UIUC, he received his BS degree in Computer Science from Caltech in 2008.

May 17: Ehsan Totoni (University of Illinois)
Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Abstract: Power dissipation and energy consumption are becoming increasingly important architectural design constraints in different types of computers, from embedded systems to large-scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the best use of exponentially increasing numbers of transistors within the power and energy budgets. Intel SCC is an appealing option for future many-core architectures. In this paper, we use various scalable applications to quantitatively compare and analyze the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. These platforms include the Intel Single-Chip Cloud Computer (SCC) many-core, the Intel Core i7 general-purpose multi-core, the Intel Atom low-power processor, and the Nvidia ION2 GPGPU. Our results show that the GPGPU has outstanding results in performance, power consumption and energy efficiency for many applications, but it requires significant programming effort and is not general enough to show the same level of efficiency for all the applications. The “light-weight” many-core presents an opportunity for better performance per watt over the “heavy-weight” multi-core, although the multi-core is still very effective for some sophisticated applications. In addition, the low-power processor is not necessarily energy-efficient, since the runtime delay effect can be greater than the power savings.

Bio: Ehsan Totoni is currently working toward the Ph.D. at University of Illinois at Urbana-Champaign, under the advice of Professor Laxmikant Kale. He received the B.S. degree in computer engineering from Sharif University of Technology and the M.S degree in computer science from University of Illinois at Urbana-Champaign. His research areas include power and energy efficiency, performance analysis and modeling and Exascale design.

May 10: John Hart (University of Illinois)

Abstract: AvaScholar is an I2PC Application framework designed to demonstrate applications of scalable parallel programming for the task of supporting remote instruction and education, ranging from academic classes to industry seminars. With AvaScholar we seek to find future technologies that will rely critically on improved computational speeds that will require parallel programming, and support a testbed of parallel applications we can use to motivate, test and demonstrate our parallel programming tools from the I2PC SafeSpeed and Acrobatics projects. AvaScholar consists of two components. The AvaScholar Instructor focuses on real-time 3-D capture and reconstruction, whereas the AvaScholar Client focuses on collecting demographics and tracking the engagement of large numbers of remote participants. We will examine the components of this system, demonstrate current progress in parallel implementation and scalability, and discuss current and future opportunities for leveraging I2PC parallel programming tools for further performance gains.

Bio: John C. Hart is an Professor in the Department of Computer Science at the University of Illinois, Urbana-Champaign where he studies computer graphics and computational topology. Prof. Hart is a past Editor-in-Chief of ACM Transactions on Graphics. He is a co-author of "Real-Time Shading" and a contributing author for "Texturing and Modeling: A Procedural Approach." He served from 1994-9 on the ACM SIGGRAPH Executive Committee, and is an Executive Producer of the documentary "The Story of Computer Graphics." Prof. Hart received his B.S. from Aurora University in 1987, and an M.S. (1989) and Ph.D. (1991) from the Electronic Visualization Laboratory at the University of Illinois at Chicago. He interned with Alan Norton at the IBM T.J. Watson Research Center in 1989 and with Pixel Machines at AT&T Bell Labs in 1990. He was a Postdoctoral Research Associate at the EVL and NCSA until 1992, and an Assistant then Associate Professor in the School of EECS at Washington State University until 2000.

May 3: Peng Wu (IBM Research)
Reusing-JITs are from Mars, Dynamic Scripting Languages are from Venus
Video | Slides

Abstract: Whenever there is an interest to compile a new dynamically typed language, reusing an existing statically typed language compiler (reusing JITs) is always an appealing option. One popular trend is to use the Java JIT to compile JVM languages such as Jython, JRuby, Scala, Clojure, Groovy. and Javascript. Others have tried to leverage LLVM to compile Python (Unladen-swallow) and Ruby (Rubinius).

Existing reusing JITs, however, do not deliver the kind of performance boost as proponents have hoped. The performance of JVM languages, for instance, often lags behind standard interpreter implementations. Even more customized solutions that extend the internals of a JIT compiler for the target language cannot compete with those designed specifically for dynamically typed languages. Our own Fiorano JIT compiler, a reusing JIT based on IBM's product JIT, is a living example of such a phenomenon. As a state-of-the-art reusing JIT for Python, Fiorano JIT compiler outperforms two other reusing JITs (Unladen Swallow and Jython), but still has a noticeable performance gap with PyPy, a custom designed Python JIT and VM.

This talk offers an in-depth look on the reusing JIT phenomenon. I will discuss techniques that have proved effective based on our own experience of building the Fiorano JIT. More importantly, I will talk about common pitfalls of today's reusing JITs, the most important of which is not focusing sufficiently on specialization, an abundant optimization opportunity unique to dynamically typed languages.

Bio: Peng Wu is a research staff member from IBM T.J. Watson research center in Yorktown Heights, NY, shortly after receiving her Ph.D in computer science from UIUC in 2001. Her research interests include optimization of dynamic scripting languages and Java, trace compilation, and software exploitation and support for TM, SIMD and GPGPU. Part of her research focuses on unleashing the power of new hardware features via compilation such as for CELL BE, BlueGene, and POWER; part is devoted simply to improve the frontiers of performance driven compilation. In the past 10 years, she has actively contributed to IBM's product compilers including the XL compiler for C/C++/Fortran and J9 JVM/JIT and held more than a dozen patents.

Apr 26: Josh Fryman (Intel)
UHPC/ExaScale: "There Ain't No Such Thing As A Free Lunch"
I2PC-internal only: Video

Abstract: The ExaScale machine criteria put forth by DARPA and DoE have set a very aggressive goal for Exa-FLOP computation by 2018 in a 20 MW power budget. No current commercial approach appears to be capable of achieving this design target due to the energy-efficiency barrier. The root causes for failing this future vision encompass the first principles behind circuits, networks, architectures, runtimes, and programming systems. Unless the entire community moves in new directions immediately, no ExaScale machine will be built. This talk will explore what the exact nature of the Ubiquitous High Performance Computing (UHPC) challenge from DARPA requires, starting from measured test chip data on the first principles. The implications of these results imply that many research efforts may be moving in the wrong direction.

Bio: Joshua Fryman is the chief CPU architect of Intel's design for the DARPA UHPC challenge. The program was begun in 2010, and confronts the challenges of building potentially Exa-Scale systems on highly constrained energy budgets, problems with resiliency, and the challenge of programmability and usability. He received his B.S. in Computer Engineering from the University of Florida. He spent several years designing cable and satellite TV platforms used by consumers the world over. He then obtained his PhD from Georiga Tech in Computer Architecture before joining Intel as a research scientist. He has been involved in many programs at Intel (Tera-scale research, Larrabee/MIC, Ct, V-ISA, UHPC).

Apr 12: Uzi Vishkin (University of Maryland)
Time to Right the Transition to General-Purpose Many-Core Parallelism

Abstract: The challenge of reinventing general-purpose computing for parallelism came into focus 2003, once processor clock frequencies generally stopped improving, is yet to be met. I will present evidence that, under some assumptions, the effective utilization of potential desktop computing speed in 2012 is around 1% of what desktop machines could have provided had they been built and programmed differently; one such assumption is that typical programmers’ background does not exceed a Bachelor degree in CS. During the last decade, high-performance general-purpose application innovations, excluding graphics, were minimal in comparison, for instance, to internet and mobile. Perhaps since programming many-cores effectively is too difficult and hence too costly, not enough applications have been developed. This led to a vicious cycle. Lacking applications, an insufficient business case for building the best possible many-core preempts both competition among vendors and a renewed programming model contract between vendors and application developers. The resulting threats to validity/robustness of system, programming and application research and education when only suboptimal hardware is used must also be better recognized.

I will claim that the explicit multi-threaded (XMT) on-chip platform, developed by my research team, can do better by order-of-magnitude over vendors’ many-cores on both ease-of-programming and speedups over best serial solutions and support both claims by experimental data. For ease-of-programming the data include more advanced parallel algorithms, comparable problems taught at earlier developmental stages (e.g., high-school vs. graduate school), a report by DoD employee of minimal effort for XMT programming beyond a serial version for several parallel graph algorithms, and a joint UIUC/UMD course in which no student was able to get speedups over serial on OpenMP running on commercial SMP hardware, while their speedups on XMT were in the range 7X to 25X. For speedups, stress tests of XMT relative to state-of-the-art CPUs and GPUs for irregular fine-grained problems show speedups of up to 43X; these results assume similar silicon area and power, but much simpler algorithms. To facilitate these advantages, XMT was set up as a clean-slate design supporting the foremost theory of parallel algorithms.

Bio: Uzi Vishkin has been Professor at the University of Maryland Institute for Advanced Computer Studies (UMIACS) since 1988. His prior affiliations included Technion, IBM T.J. Watson, NYU, and Tel Aviv University, where he was also CS Chair. Per his ACM Fellow citation, he “played a leading role in forming and shaping what thinking in parallel has come to mean in the fundamental theory of Computer Science”. The presentation framework in several parallel PRAM algorithms textbooks, which also include quite a few parallel algorithms he co-authored, is the 1982 Shiloach-Vishkin work-depth methodology. His research team’s recent work on his explicit multi-threaded (XMT) many-core ‘PRAM-On-Chip’ architecture refuted the common wisdom that PRAM algorithms are irrelevant for practice. He is an ISI-Thompson Highly Cited Researcher, was named a Maryland Innovator of the Year for his PRAM-On-Chip venture whose main patent was cited in quite a few patents of major vendors, and his proposal for Reinvention of Computing for Parallelism was ranked first among 49 proposals in a University System of Maryland competition for Maryland Research Centers of Excellence.

Apr 9: Brad Chamberlain (Cray, Inc.)
Chapel: Striving for Productivity at Petascale, Sanity at Exascale

Abstract: Chapel is an emerging open-source programming language whose design and development is being led by Cray Inc. with the goal of making parallel programming more productive. Chapel was developed as part of Cray's entry in the DARPA High Productivity Computing Systems program (HPCS) and is designed to support diverse application areas and system scales, from multicore desktops to the largest HPC systems. Chapel strives to improve programmability and generality compared to current parallel programming models without sacrificing performance and scalability. In this talk, I'll start by motivating Chapel and providing an overview of its core features. I'll then describe high-level features that permit advanced users to define abstractions like parallel loop schedules and distributed arrays. Finally I'll describe current work to extend Chapel to make it suitable for emerging hierarchical and heterogeneous processor architectures such as those being considered for exascale systems. In wrapping up, I'll provide an overview of the project's status and future.

Bio: Bradford Chamberlain is a Principal Engineer at Cray Inc. where he works on parallel programming models, focusing primarily on the design and implementation of the Chapel language in his role as technical lead for that project. Brad received his Ph.D. in Computer Science & Engineering from the University of Washington in 2001 where his work focused on the design and implementation of the ZPL parallel array language, particularly on its concept of the region --- a first-class index set supporting global-view data parallelism. In the past, he has also worked on languages for embedded reconfigurable processors and on algorithms for accelerating the rendering of complex 3D scenes. Brad remains associated with the University of Washington as an affiliate faculty member. He received his Bachelor's degree in Computer Science with honors from Stanford University in 1992.

Apr 5: Hyesoon Kim (Georgia Tech)
When GPUs meet CPUs: opportunities, challenges and solutions in heterogeneous architectures

Abstract: The last decade has seen a paradigm shift in the architecture of computing platforms: Uni-processors giving way to multi-core (many-core) processors, and now the industry is moving towards heterogeneous architectures that combine CPUs and GPUs on the same chip. Heterogeneous architectures are especially attractive as they can provide high performance and energy-efficiency for both general purpose applications as well as high throughput applications. However, these architectures introduce several new challenges: including programming, determining power and performance trade-offs and developing hardware solutions that exploit the underlying heterogeneity.

In this talk I will present some of our recent work that reduces the software effort required in programming such architectures, and provides hints to estimate the performance and power behavior of CPU+GPU systems. I will also discuss architecture solutions that improve overall system performance by taking into account the difference in characteristics of CPU and GPU applications, and optimizing the cache partitioning, prefetching, and DRAM scheduling to best suit the workload needs.

Bio: Hyesoon Kim is an Assistant professor in the School of Computer Science at Georgia Institute of Technology. Her research interests include high-performance energy-efficient heterogeneous architectures, programmer-compiler-microarchitecture interaction and developing tools to help parallel programming. She received a BA in mechanical engineering from Korea Advanced Institute of Science and Technology (KAIST), an MS in mechanical engineering from Seoul National University, and an MS and a Ph.D in computer engineering at The University of Texas at Austin. She is a recipient of the NSF career award in 2011.

Mar 29: Kath Knobe (Intel)
Concurrent Collections (CnC): Application parallelism via coordination
Poster | Slides

Abstract: Explicitly parallel languages and explicitly serial languages are each over-constrained, though in different ways. Concurrent Collections (CnC), on the other hand, maximizes the scheduling freedom for a given target (efficiency) and also among distinct targets (portability). The domain expert writing a CnC program focuses on the meaning of the application, not on how to schedule it.

To prepare an application for parallel execution, we first need to answer two questions: “How should the data and computation be divided into chunks that are potential parallel?” and “What are the scheduling constraints among these chunks?” A CnC program specifies exactly this information. The resulting program is “ready for parallelism.”  CnC isolates the work of the domain expert (interested in finance, chemistry, gaming…) from the tuning expert (interested in load balance, locality, scalability, …) This isolation minimizes the need for the domain expert to think about all the complications of parallel systems. CnC is a coordination language that specifies the required orderings among  potentially parallel chunks of application. As a coordination language it must be paired with a computation language. Intel® Concurrent Collections for C++ supports C++ programs.

The talk will include an introduction to the CnC domain specification, an overview of an entirely separate approach for specifying the tuning of the domain spec and performance results for the Intel distributed CnC/C++ system.

This talk describes a new compiler, Intel SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both processor multiple cores and SIMD vector units.  ispc draws from GPU programming languages, which have shown that for many applications, the easiest way to program SIMD units is to use a single-program, multiple-data (SPMD) model, with one instance of the program mapped to each SIMD lane. This talk will describe discuss language features that make ispc easy to adopt and use productively with existing software systems and present results showing that ispc delivers up to 35x speedups on a 4-core system and up to 240x speedups on a 40-core system for complex workloads compared to serial C++ code.

Bio: Kathleen Knobe worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms including Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipment’s Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired by Intel. She currently works in the Software Solutions Group / Developer Products Group) at Intel.

In addition to CnC, her major projects include the Subspace Model of computation (a compiler internal form for parallelism), Data Optimization (compiler transformations for locality), Array Static Single Assignment form (a method of achieving for array elements the advantages that SSA has for scalars), Weak Dynamic Single Assignment form (a global method for eliminating overwriting of data to maximize scheduling flexibility), Stampede (a programming model for streaming media applications).

Mar 28: Tom Wenisch (University of Michigan)
Efficiency Challenges in Warehouse-Scale Computers
Poster | Slides

Abstract: Architects and circuit designers have made enormous strides in managing the energy efficiency and peak power demands of processors and other silicon systems.  Sophisticated power management features and modes are now myriad across system components, from DRAM to processors to disks. And yet, despite these advances, typical data centers today suffer embarrassing energy inefficiencies: it is not unusual for less than 20% of a data center's multi-megawatt total power draw to flow to computer systems actively performing useful work.   Managing power and energy is challenging because individual systems and entire facilities are conservatively provisioned for rare utilization peaks, which leads to energy waste in underutilized systems and over-provisioning of physical infrastructure. Power management is particularly challenging for Online Data Intensive (OLDI) services---workloads like social networking, web search, ad serving, and machine translation that perform significant computing over massive data sets for each user request but require responsiveness in sub-second time scales.  These inefficiencies lead to worldwide energy waste measured in billions of dollars and tens of millions of metric tons of CO2.

In this talk, I discuss what, if anything, can be done to make OLDI systems more energy-proportional. Specifically, through a case study of Google's Web Search application, I will discuss the applicability of existing and proposed active and idle low-power modes to reduce the power consumed by the primary server components (processor, memory, and disk), while maintaining tight response time constraints, particularly on 95th-percentile latency.  Then, I will briefly discuss our work on PowerRouting, a proposal to dynamically switch servers among redundant power feeds to reduce overprovisioning in data center power delivery infrastructure.  Finally, I will close with brief comments on our new and ongoing work in power, performance, and thermal management at the other end of the computing spectrum, namely Smart Phone devices. 

Bio: Thomas Wenisch is the Morris Wellman Faculty Development Assistant Professor of Computer Science and Engineering at the University of Michigan, specializing in computer architecture. Tom's prior research includes memory streaming for commercial server applications, store-wait-free multiprocessor memory systems, memory disaggregation, and rigorous sampling-based performance evaluation methodologies.  His ongoing work focuses on data center architecture, energy-efficient server design, smartphone architecture, and multi-core / multiprocessor memory systems. Tom received an NSF CAREER award in 2009, two papers selected in IEEE Micro Top Picks, and a Best Paper Award at HPCA 2012. Prior to his academic career, Tom was a software developer at American Power Conversion, where he worked on data center thermal topology estimation. He is co-inventor on six patents. Tom received his Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University.

Mar 15: Matt Pharr (Intel)
ispc: A High-Performance SPMD Compiler for The CPU
Poster | Video | Slides

Abstract: SIMD parallelism has become an increasingly important mechanism for delivering performance in modern CPUs, due its power efficiency and relatively low cost in die area compared to other forms of parallelism. Unfortunately, languages and compilers for CPUs haven't kept up with the hardware's capabilities.  Existing CPU parallel programming models have almost entirely focused on multicore parallelism, neglecting the substantial computational capabilities available in SIMD vector units, while GPU-oriented languages like OpenCL suffer from constraints that impair ease of use on CPUs and lack capabilities needed to achieve maximum efficiency on CPUs due to their focus on GPU architectures.

This talk describes a new compiler, Intel SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both processor multiple cores and SIMD vector units.  ispc draws from GPU programming languages, which have shown that for many applications, the easiest way to program SIMD units is to use a single-program, multiple-data (SPMD) model, with one instance of the program mapped to each SIMD lane. This talk will describe discuss language features that make ispc easy to adopt and use productively with existing software systems and present results showing that ispc delivers up to 35x speedups on a 4-core system and up to 240x speedups on a 40-core system for complex workloads compared to serial C++ code.

Bio: Matt Pharr is a Principal Engineer at Intel where he most recently has built ispc, the Intel SPMD Program Compiler.  He previously was the lead lead graphics architect in the Advanced Rendering Technology group, which researched new interactive graphics algorithms and programming models. Prior to coming to Intel, he co-founded Neoptica, which developed new programming models for graphics on heterogeneous CPU+GPU systems; Neoptica was acquired by Intel.

Before Neoptica, Matt was in the Software Architecture group at NVIDIA, co-founded Exluna (acquired by NVIDIA), worked in Pixar's Rendering R&D group, and received his Ph.D. from the Stanford Graphics Lab.  With Greg Humphreys, he wrote the textbook "Physically Based Rendering: From Theory to Implementation", which has been used for graduate-level graphics classes at over 20 universities.

Mar 01: Forest Godfrey (Cray)
The Cray Gemini Network: Architecture & Resiliency Analysis
Poster | Video | Slides

Abstract: In the early days of high performance computing, supercomputers were assembled from a small number of extremely fast and customized processors attached to a pool of memory.  This has given way to highly scaled networks of commodity microprocessors.  To achieve this scaling challenges in communication and reliability need to be overcome.

This talk will focus on two ares.  The first part will provide an overview the Cray(R) Gemini network which is one such highly scaled computer architecture. It is used in the Cray XE and XK systems such as "Blue Waters" currently being installed at NCSA and "Titan" currently being installed at Oak Ridge National Labs.

The second part of the talk will focus on the resiliency analysis techniques used to analyze the expected hardware failure rates (as typically measured in units of failures per one billion hours, or Failures In Time).  This analysis is intended to motivate further study into software techniques for mitigation of hardware failures.

Bio: Forest Godfrey is a principal engineer at Cray Inc, having been with the company for more than 10 years.  He is currently serving as the lead software architect for GPUs at Cray as well as the lead architect for future system control environments.

Over his career, Mr. Godfrey has worked as a kernel and system controller programmer on the Cray(R) X1, X1E, X2, and XT series of supercomputers. He served on the system architecture teams for the X2, XE and XK series of supercomputers as well as several other as-yet-unshipped systems.  In his system architecture role, Mr. Godfrey has focussed on the Reliability, Availability and Serviceability (RAS) of Cray systems.

Mr. Godfrey received his Bachelor's in Computer Science from Carnegie Mellon University in 1999.  He holds two US patents for his work with Cray.

Feb 23: Zhiyuan Li (Purdue University)
How to Make Loop-level Data Dependence Profiling Fast and Memory Efficient?
Website | Video | Slides

Abstract: Execution-driven data dependence profiling for nonnumerical programs has gained significant interest in recent years because it can resolve memory access ambiguity exactly through program execution, which allows data dependences to be analyzed exactly in spite of complicated aliases and referencing expressions which often defeat the compiler. Dependence result obtained through profiling is exact for specific inputs only. Nonetheless,it provides valuable insights both to compiler designers (who may discover ways to improve their compilers) and programmers (who may discover ways to parallelize the program by hand).

Unfortunately, dependence profiling itself can take tremendous memory and machine time. For practical use, both the memory efficiency and the execution speed need to be improved by at least one or two orders of magnitude from the state of the art. We explore ways to make such improvements with the help of several compiler and runtime techniques. These methods include a) parallelized profiling (with various granularities), b) analysis of type consistency and aliasing to allow a method to embed memory tag with the original data structure instead of using the conventional hash table, c) a partial dependence graph that is proven to be sufficient for loop transformation and parallelization. Experimental data for SPEC CPU 2006 benchmarks are presented.

Bio: Dr. Zhiyuan Li joined Purdue University in 1997 where he is now a professor in the Department of Computer Science. He received his PhD degree in Computer Science in 1989 from University of Illinois, Urbana-Champaign where he returned to work in 1990-1991 as a senior software engineer in Center for Supercomputing Research and Development after one year teaching at York University, Canada. From 1991 to 1997 he taught at Department of Computer Science, University of Minnesota. Dr. Li currently takes on several research projects sponsored by NSF and by Intel, ranging from petascale numerical applications, compiler tools for multicore machines, compiler tools for reliable networked embedded systems, and solutions for memory bottleneck problems on multicore.

Feb 16: Shan Lu (University of Wisconsin)
Concurrency-bug detection, diagnosis, and fixing
Poster | Video | I2PC-internal only: Slides

Abstract: Synchronization mistakes in multi-threaded software (i.e., concurrency bugs) are big threats to system reliability in the multi-core era. They are difficult to detect before code release because of the huge state space of multi-threaded software. Once released, they lead to non-deterministic production-run failures that are difficult to diagnose. When eventually detected, they cost a lot of manual effort to fix because of the inherent complexity of synchronization.

In this talk, I will describe our research on detecting, diagnosing, and fixing concurrency bugs. I will first present our effect-oriented concurrency-bug detection tools, ConMem and ConSeq. These tools identify potential failure sites in a program and look for suspicious interleavings that could lead to these failures. The unique effect-oriented perspective enables ConMem and ConSeq to detect concurrency bugs before they manifest with higher coverage and accuracy than traditional cause-oriented approaches. I will then briefly discuss our sampling-based production-run concurrency-bug failure diagnosis tool, CCI, and our automatic concurrency-bug fixing tool, AFix. I will conclude the talk by discussing other on-going research in my group.

Bio: Shan Lu is an Assistant Professor of Computer Sciences at University of Wisconsin, Madison.

She earned her Ph.D. at University of Illinois, Urbana-Champaign, in 2008, where she completed a thesis on "Understanding, Detecting, and Exposing Concurrency Bugs".

At University of Wisconsin, her group works on detecting, diagnosing, and fixing concurrency bugs and performance bugs. Shan Lu won NSF Career Award in 2010. Her research group is currently supported by Claire Boothe Luce faculty fellowship and NSF grants.

Feb 9: Andrew Chien (University of Chicago)
Technology Scaling and the Future of Microprocessors: The 10x10 Approach
Website | Poster | Video

Abstract: In the waning days of Moore’s Law, Dennard scaling (density, speed, and energy) has given way density scaling with incremental improvements in transistor speed and energy. In response to energy-constraints, architects are pursuing two main approaches – parallelism (multicore CPU, GPU) and heterogeneity (accelerators, SoC) – with major challenges in programmability. While thousands of papers have been written on parallel programming, there has been little research on how to balance programmability with the energy-performance benefits of customization.

We are pursuing a new paradigm, “10x10”,which enables systematic exploration of applications, programmability, and customization in a new model of general-purpose computer architecture – federated accelerators. 10x10 moves beyond the general purpose architecture implementation paradigm (90/10 optimization) by divides workloads into clusters, and systematically exploits customization separately for each cluster in a federated accelerators architecture. We call this new paradigm “10x10” because it partitions application workloads and optimizes for 10 different 10% cases, not a monolithic 90/10.

We are applying 10x10 a range of general-purpose workloads and DOE Science applications. Research challenges include how to cluster workloads, architect programmable accelerators, federate accelerators, and program them effectively. We will present the 10x10 paradigm, and initial results. The ultimate goal is to enable stable perspective on application clusters and requirements, and accelerators that enables long-term programmability and a predictable roadmap of computer architectures.

Bio: Dr. Andrew A. Chien is the William Eckhardt Professor in Computer Science and Senior Fellow in the Computation Institute at the University of Chicago, and Senior Computer Scientist at the Argonne National Laboratory. Dr. Chien served as Vice President of Research at Intel Corporation, leading long-range and “disruptive technologies” research at Intel Research. At Intel, he also led Intel’s external research programs, including government and higher education engagements. In this role, Chien launched imaginative new efforts in robotics, wireless power, sensing and perception, nucleic acid sequencing, networking, cloud, and ethnography. Working with external partners, Chien was instrumental in creation of the Universal Parallel Computing Research Centers (UPCRC) focused on parallel software, the Open Cirrus Consortium focused on Cloud computing, and Intel’s Exascale Research program.

For more than 20 years, Chien has been a global research and education leader, and an active researcher in parallel computing, computer architecture, programming languages, networking, clusters, grids, and cloud computing. Previous academic positions include the SAIC Chair Professor in Computer Science and Engineering, and founding Director of the Center for Networked Systems at the University of California at San Diego. While at UCSD, he also founded Entropia, a widely-known Internet Grid computing startup. From 1990 to 1998, Chien was a Professor of Computer Science at the University of Illinois at Urbana-Champaign with joint appointments at the National Center for Supercomputing Applications (NCSA) where he was a research leader for parallel computing software and hardware, and developed the well-known Fast Messages, HPVM, and Windows NT Supercluster systems.

Dr. Chien is a Fellow of the American Association for Advancement of Science (AAAS), Fellow of the Association for Computing Machinery (ACM), Fellow of Institute of Electrical and Electronics Engineers (IEEE), and has published over 130 technical papers. Chien served on the Board of Directors for the Computing Research Association (CRA), Advisory Board of the National Science Foundation’s Computing and Information Science and Engineering (CISE) Directorate, and is currently on the Editorial Board of the Communications of the Association for Computing Machinery (CACM). Chien received his Bachelor's in electrical engineering, Master's and Ph.D. in computer science from the Massachusetts Institute of Technology.

Feb 2: CJ Newburn (Intel)
Many Integrated Core (MIC): What can you do with an IA processor on the other end of the wire?
Poster | I2PC-internal only: Video

Abstract: The motivation for microarchitectures that make different trade-offs to achieve more power efficient, throughput-oriented processing units is clear.  But how do you want to program it, and how do you want that to fit into your overall ecosystem?  Whether you’re a researcher or a commercial practitioner, whether you thinking short term or looking ahead to exascale, there are ample reasons why you may want the benefits of having a fully-capable IA processor on the other end of the PCIe wire, or as an endpoint on your network.

This talk will explore some of the execution models and system capabilities that having a many-core IA enables.  It also draws on early experience on the Knights family of processors to take a look at the tuning implications that such a system has. I’ll wrap up with tastings from several areas that we’d like help with from the research community.

Bio: Chris (CJ) Newburn is currently focused on the software stack and its performance for Intel’s Many Integrated Core (MIC) architecture products, from programming tools down to middleware.  He has served as a feature architect for Intel's Intel64 platforms, and has contributed to a combination of hardware and software technologies that span heterogeneous compiler optimizations, middleware, JVM/JIT/GC optimization, acceleration hardware, ISA changes, microcode and microarchitecture over the last fourteen years.  Performance analysis and tuning have figured prominently in the development and production readiness work that he's done.  He wrote a binary-optimizing, multi-grained parallelizing compiler as part of his Ph.D. at Carnegie Mellon University.  Before grad school, in the 80s, he did stints in a couple of start-ups, working on a voice recognizer and a VLIW mini-super computer.  He's glad to be working on volume products that his Mom uses.  And he’ll soon be upgrading to have the fastest chip in the world in a machine under his desk.

Jan 26: Franck Cappello (INRIA and the University of Illinois)
Redesigning Fault tolerance for High Performance Computing
Poster | Video | Slides

Abstract: Fault tolerance is already a major concern for users of large-scale message passing HPC applications. Future HPC systems with their projected shorter MTBF will make this problem even more difficult to address. The current fault tolerance approach in HPC essentially relies on concepts defined 30 years ago for generic distributed systems. Several recent studies question its applicability to next generation HPC systems and applications and advocate for the exploration of novel, potentially disruptive fault tolerance techniques.

In this talk, we will analyze two important components of the fault tolerance design: the HPC applications and the HPC systems. We will show how we can conceive more efficient fault tolerance based on the fundamental characteristics of the HPC applications and the dynamic behaviors of the HPC systems. In particular we will present the notions of send-determinism, partial restart, hybrid fault tolerance protocols and processes clustering in HPC applications. We will also show how we can improve fault tolerance by exploring the mine of information generated dynamically by HPC systems about the state of their components.

Bio: Franck Cappello holds a research director position at INRIA and is a visiting professor in Computer Science at University of Illinois at Urbana Champaign. He is co-director with Marc Snir of the INRIA-Illinois Joint-Laboratory on PetaScale Computing ( where he is also leading the Resilience/Fault Tolerance effort. He is leading the roadmaping effort on Resilience/Fault Tolerance for IESP (International Exascale Software Project: and and EESI (European Exascale Software Initiative). Before 2009, he initiated and directed the Grid5000 project (, a nationwide computer science platform for research in large scale parallel and distributed systems used by hundreds of researchers. He was Technical paper co-chair of IEEE/ACM SC2011, Program chair of HiPC 2010, Program co-Chair of IEEE CCGRID’2009 and General Chair of IEEE HPDC’2006.

Jan 19: Zhiqiang Ma (Intel)
Improving Performance and Robustness of Intel® Inspector
Poster | Slides

Abstract: Concurrency bugs, especially data races, are notorious in parallel software development. They are difficult to reproduce and debug. Automatic bug detection tools are a must–have for parallel software development. Intel® Thread Checker is a powerful tool currently shipped in Intel® Inspector for automatically detecting data races and other multithread programming errors. Because it is a dynamic software tool, it has to instrument the program and watch every memory access and it suffers from high runtime overhead.

On the other hand, Thread Checker runs in the same address space as that of the program under analysis. The bugs in program under analysis can potentially corrupt the data structure in Thread Checker and cause the tool to crash. This can limit the usability of the tool.

We have been working to address both the performance and robustness challenges. In this talk, I will present various techniques already built into the product and techniques currently in research phase for improving the performance and robustness of Thread Checker.

Bio: Zhiqiang Ma is the lead of Intel® Thread Checker team with Software and Services Group of Intel Corporation. He is the lead designer and developer of Intel® Thread Checker. His work and interest mainly focus on parallel program analysis, especially dynamic analysis techniques for concurrency bug detection. He holds 6 USPTO granted patents and a few pending patent applications.


Dec 8 2011: Xuehai Qian (UIUC)
BulkSMT: Designing SMT Processors for Atomic-Block Execution
Website | Poster | Slides

Abstract: Multiprocessor architectures that continuously execute atomic blocks of instructions can improve performance and software productivity. However, all of the proposals for such architectures assume single-context cores as their building blocks --- rather than the widely-used Simultaneous Multithreading (SMT) cores. As a result, they are effectively wasting hardware resources.

This paper presents the first SMT design that supports continuous atomic-block (or transactional) execution of its contexts. Our design, called BulkSMT, can be used either in a single-core platform or in a multi-core of SMTs.

We present a set of BulkSMT configurations with different cost and performance. We also describe the architectural primitives that enable atomic-block execution in an SMT core and in a multicore of SMTs. Our results, based on simulations of SPLASH-2 and PARSEC codes, show that BulkSMT supports atomic-block execution cost-effectively. In a 4-core multicore with eager atomic-block execution, BulkSMT reduces the execution time of the applications by an average of 26% compared to running on single-context cores.

Bio: Xuehai Qian is a PhD student in computer science at the University of Illinois at Urbana-Champaign. He works with Prof. Josep Torrellas. His research lies within the fields of cache coherence and memory models for multi- or many-core. He is currently focusing on the scalable and efficient architecture for improved programmability. He received his bachelor's degree in computer engineering from Beihang University, and his master's degree in computer science from Institute of Computing Technology (ICT), Chinese Academy of Sciences. This particular work has been accepted in HPCA, 2012.

Dec 1: Rob Van der Wijngaart (Intel)
A Reproducible, Verifiable Parallel Graph Analysis Benchmark
Poster | Video | Slides

Abstract: Good scientific experiments are done such that all their relevant conditions and parameters can be reproduced in a laboratory. Good scientific experiments are defined such that it can be determined unambiguously that the experiment has indeed been done according to specification. In short, good scientific experiments are reproducible and verifiable. The well-known parallel graph analysis benchmark SSCA2 does not constitute a good scientific experiment, because it fails in both respects. We need to fix that, because graph processing is becoming increasingly important in many areas of computing, and graph analysis benchmarks will be used in future procurements, as well as in problem suites used for system optimizations.

This presentation describes a simple method to accomplish that, while providing a computation and communication efficiency that is higher than that of existing implementations, especially with respect to the creation of the graph data structures. We will also briefly review how the new graph benchmark under development in the graph500 project is addressing these issues.

Bio: Rob is a senior software engineer in Intel’s Software and Services Group. His main interest is parallel computing architecture, algorithms, and software. During his current tenure he has worked on Intel’s 80- and 48-core research processors and a variety of exa-scale projects, among others. Before joining Intel 6 years ago, he worked at NASA Ames Research Center for a dozen years in parallel computing research, focusing on High Performance Computing applications, programming tools for clusters, and benchmarking.

Nov 17: Paul Petersen (Intel)
The Hardest Part of Parallel Programming is Understanding the Limitations of your Serial Algorithms

Abstract: Serial algorithms typically run very inefficiently on parallel machines. This may sound like an obvious statement, but it is the root cause of why parallel programming is considered to be difficult. The current state of the computer industry is still that almost all programs in existence are serial.  To address this situation, Intel has created Parallel Studio, and in particular Parallel Advisor.

This talk will describe the techniques used in Parallel Advisor to provide a developer with the tools necessary to understand the limitations of the existing serial algorithms.  One the limitations are known the developer can refactor the algorithms and reanalyze the resulting code to see if it could run effectively on parallel hardware. Almost all implementations of serial algorithms are serial for a reason, and the tools available in Parallel Advisor help the user expose these reasons so that appropriate rewrites can be done.

Bio: Paul Petersen is a Sr. Principal Engineer in the Software and Solutions Group (SSG) at Intel.   He received a Ph.D. degree in Computer Science from the University of Illinois in 1993. After UIUC, he was employed at Kuck and Associates, Inc. (KAI) working on auto-parallelizing compiler (KAP), and was involved in the early definition and implementations of OpenMP. While at KAI, he developed the Assure line of parallelization/correctness products, for Fortran, C++ and Java. In 2000, Intel Corporation acquired KAI, and he joined the software tools group. At Intel, he worked with the tools group to create the Thread Checker products, which evolved into the Inspector and Advisor components of the Intel® Parallel Studio.  Inspector uses dynamic binary instrumentation to detect memory and concurrency bugs, and Advisor uses similar techniques along with performance measurement and modeling to assist developers in transforming existing serial applications to be ready for parallel execution.

Nov 10: Michael Voss (Intel)
Intel(R) Threading Building Blocks 4.0: Go with the flow!
Poster | Blog | Video

Abstract: Many applications can be naturally expressed as computational graphs, where vertices represent computations and the edges express either ordering relationships or the passing of data between these computations. Computational graphs appear across many domains including digital content creation, gaming, finance, mobile computing and technical computing. In this talk, I will present a new feature in the Intel® Threading Building Blocks (Intel® TBB) library that allows users to easily express and execute parallel computational graphs. Intel TBB is a widely used, award-winning C++ template library for creating reliable, portable, and scalable shared-memory parallel applications. The Intel TBB flow graph leverages the library’s task scheduler to create computational graphs that compose with the tasks and generic parallel algorithms provided by the library, allowing users to easily create hierarchical applications with nested parallelism.

After a brief introduction to the Intel® Threading Building Blocks library, I will present a overview of the new flow graph feature, describing the graph object, its node types and edges. I will then present examples of flow graphs including a dependency graph that performs a blocked wave-front computation across a matrix, an image processing application that performs feature recognition, and an implementation of the dining philosophers problem. I will also compare the new flow graph with other existing features in Intel TBB: directed acyclic graphs of tasks and the pipeline class. I will conclude with a summary and pointers to additional information about the flow graph and other new features in Intel® Threading Building Blocks 4.0.

Bio: Michael Voss is a Software Architect in Technical Computing, Analyzers and Runtimes at Intel. He is one of the lead developers of Intel® Threading Building Blocks (Intel® TBB) and the architect of the Intel TBB flow graph. Prior to joining Intel in 2005, he was an Assistant Professor at the Edward S. Rogers Sr. Department of Electrical and Computer Engineering at the University of Toronto. He received his Ph.D. in Electrical Engineering from Purdue University in 2001. He interests include parallel computing, adaptive program optimization, and optimizing compilers.

Nov 3: Doug Carmean (Intel)
Improving Permeability in System Architecture
Poster | Video (I2PC-internal only)

Abstract: Hardware complexity has outpaced software development by a wide margin. Long gone are the days where well written applications and compilers could extract every drop of performance in a computing platform. Software developers are faced with the daunting task of parallelizing their applications using archaic tools that have not kept pace with hardware.  Further, programmers attempting to utilize performance of specialized hardware must become proficient using hybrid environments like OpenCL.

The principal of hardware/software codesign is often cited as the panacea for closing the complexity gap and improving programmer productivity.  While conceptually simple, codesign is not well defined and does not necessarily lead to systems that lend them to higher productivity.  The concept of improving permeability through the hardware/software barrier is introduced as a technique to reduce overall system architecture complexity.

In this talk I will explore tradeoffs that can move specific functions from software to hardware, both for productivity and for efficiency. I will also look at examples where moving a function from hardware to software improved flexibility without compromising efficiency. Using recent product experience, we will discuss the software interfaces to hardware functions and attempt to make sense of hardware/software codesign with heterogeneous hardware.

Bio: Doug Carmean is an Intel Fellow and Researcher At Large at Intel Labs. He is responsible for creating the vision and concept for a fully programmable graphics pipeline based on IA processors that supports highly visual and parallel workloads. Carmean led the team that founded a new group at Intel to define, build and productize products from an architecture that targets the high-end discrete graphics business. He is responsible for growing the development of Larrabee from an early concept to a core piece of Intel's graphics strategy. Carmean enlisted and included key industry software developers in Larrabee's definition to ensure a compelling product.

Since joining Intel in 1989, he has held several key roles and provided leadership in Intel's microprocessor architecture development and product roadmap. As Nehalem's first chief architect, a next-generation x86 flagship processor, he led the team during the early phases of architecture definition. Prior to this position, he was a principal architect for the Pentium 4 processor where he completed the memory cluster and power architecture definition including algorithms, structures and overall functionality.

Carmean holds more than 25 patents and many pending in processor architecture and implementation, memory subsystems and low power design. He has published more than a dozen technical papers.  Doug enjoys fast cars, Canadian bicycles and scary, Italian motorcycles.

Oct 27: Edward Suh (Cornell University)
Hardware-Assisted Run-Time Monitoring for Trustworthy Computing Systems
Website | Poster | Video | Slides

Abstract: Hardware-enabled security techniques promise to greatly enhance the trustworthiness of future computing systems through their efficiency and tamper resistance. As an example, parallel run-time monitoring in hardware can help ensuring a range of security and correctness properties such as memory safety, information flow restrictions, and others with minimal overheads. In practice, however, fixed-function hardware is often difficult to justify due to their inflexibility and high development costs.

This talk will discuss how selective hardware reconfigurability and heterogeneity can enable a flexible yet efficient platform for instruction-grained run-time monitoring, and show how the monitoring capabilities can be utilized for trust. For monitoring of explicit program properties, our architecture utilizes on-chip reconfigurable fabric (FPGA) along with dedicated logic in order to provide flexibility and efficiency. The reconfigurable fabric can dynamically adapt to a range of monitoring and bookkeeping functions based on application needs without expensive hardware re-design and fabrication. At the same time, the bit-level reconfigurable logic is often more efficient for simple checks than traditional processing cores. In addition to explicit checks, software properties can also be implicitly checked through comparing behaviors of diverse program replicas. In this context, our design introduces a heterogeneous multi-core architecture that can effectively exploit redundancy among replicas. Experimental results suggest that run-time monitoring on these architecture designs can closely match performance and energy efficiency of dedicated hardware mechanisms while providing programmability. Along with the architecture optimizations, the talk will also discuss a set of run-time program monitoring techniques including dynamic information flow tracking, memory safety checks, and an extension to data races for concurrency bug detection, illustrating benefits of fine-grained monitoring.

Bio: G. Edward Suh is an Assistant Professor in the School of Electrical and Computer Engineering at Cornell University. He received a Ph.D. degree in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT) in 2005. Following the graduate school, he spent a year at Verayo Inc., leading the development of unclonable RFIDs and secure embedded processors before joining Cornell. His research interests span computer systems in general with particular focus on developing architectural techniques to improve efficiency, security, and correctness of future computing systems. Ongoing research topics include parallel and reconfigurable architecture for security and reliability, embedded cyber-physical systems, flash memory security, and mathematically optimized on-chip network design and management. He is a recipient of an NSF CAREER award, an Air Force Office of Scientific Research (AFOSR) Young Investigator Program award, and an Army Research Office (ARO) Young Investigator Program award.

Oct 20: Ingo Wald (Intel)
IVL and (R)IVL -- An experimental SPMD Compiler for SSE, AVX, and MIC/Knights*, and its Application in a High-Quality (R)endering Framework
Website | Poster | Video (I2PC-internal only)

Abstract: Unleashing the full potential of modern CPUs requires making good use of the architecture's parallelism in terms of both multi-core and, ever more importantly, SIMD. In this talk, we first introduce IVL, an experimental SPMD compiler in which the user writes a scalar program that the compiler then maps to SIMD by running a separate instance of this program in each SIMD lane. IVL uses a scalar C(++)-like syntax with some simple additional keywords to express parallelism and data layouts, and currently has back-ends for SSE, AVX, and MIC/Knights*, as well as support for "device offload" where host and device code run on different devices (eg, on a separate KNF card). To demonstrate the power of this framework we then briefly summarize results from (R)IVL, a complete real-time-high-quality renderer in which all rendering code has been written exclusively in IVL, and that is fully portable between SSE, AVX, and MIC.

Bio: Ingo Wald holds a PhD in engineering from Saarland University and is currently a research scientist at Intel Labs. After his PhD, he first worked was a post-doctoral research associate at the Max Planck Institute for Informatics in Saarbruecken, Germany, after which he joined the Scientific Computing and Imaging Institute (SCI) and School of Computing at the University of Utah as a Research Assistant Professor. His work concentrates on all aspects of real time ray tracing and photo-realistic rendering, high-performance graphics, throughput computing, and hard- and software architectures for high-performance computing.

Oct 13: Laxmikant (Sanjay) Kale and Osman Sarood (UIUC)
Controlling Core Temperatures and Saving Energy with an Adaptive Runtime System
Website | Poster | Video

Abstract: Thermal issues and energy issues will present dominant challenges for at least the next decade. This is true for client-side computing, and even more important for high-end exascale computing. I will present my research group's approach to this problem, based on adaptive runtime systems.

Cooling energy is a significant component of energy spent by a data center. The room temperature is maintained at a very low level to avoid hot-spots --- because some processors are likely to heat up more than the others. On-chip temperature sensors permit real-time monitoring. DVFS capabilities allow one to change the frequency and voltage while the computation is running, at a very low cost. So, in principle one could simply monitor the temperature continuously and change frequencies up and down depending on whether the chip is getting too hot or not. However, for parallel applications running on clusters, this presents a major challenge: slowing down a core slows down the entire parallel application because of dependencies. We handle this challenge by migrating objects away from the slowed-down processors under a sophisticated load-balancing strategy.

The machine energy itself presents additional sets of challenges. Although, in principle, the power increases nonlinearly with frequency, the idle power is a significant component of the total power consumed. Saving energy by slowing down processors is elusive in this context. We show how techniques based on runtime instrumentation are helping us tackle this problem. These techniques are helpful in client-side setting, data centers as well as HPC clusters.

Bio: Professor Laxmikant Kale has been working on various aspects of parallel computing, with a focus on enhancing performance and productivity via adaptive runtime systems, and with the belief that only interdisciplinary research involving multiple CSE and other applications can bring back well-honed abstractions into Computer Science that will have a long-term impact on the state-of-art. His collaborations include the widely used Gordon-Bell award winning (SC'2002) biomolecular simulation program NAMD, and other collaborations on computational cosmology, quantum chemistry, rocket simulation, space-time meshes, and other unstructured mesh applications. He takes pride in his group's success in distributing and supporting software embodying his research ideas, including Charm++, Adaptive MPI and the ParFUM framework.

L. V. Kale received the B.Tech degree in Electronics Engineering from Benares Hindu University, Varanasi, India in 1977, and a M.E. degree in Computer Science from Indian Institute of Science in Bangalore, India, in 1979. He received a Ph.D. in computer science in from State University of New York, Stony Brook, in 1985.

Oct 7: Cliff Click (Azul Systems)
The Pauseless GC Algorithm

Abstract: Modern transactional response-time sensitive applications have run into practical limits on the size of garbage collected heaps. The heap can only grow until GC pauses exceed the response-time limits. Sustainable, scalable concurrent collection has become a feature worth paying for.

Azul Systems has built a custom system (CPU, chip, board, and OS) specifically to run garbage collected virtual machines; we support heaps as large as 600Gig and allocation rates up to 35Gig/sec.  The custom CPU includes a read barrier instruction. The read barrier enables a highly concurrent (no stop-the-world phases), parallel and compacting GC algorithm.

The Pauseless algorithm is designed for uninterrupted application execution and consistent mutator throughput in every GC phase. Beyond the basic requirement of collecting faster than the allocation rate, the Pauseless collector is never in a “rush” to complete any GC phase. No phase places an undue burden on the mutators nor do phases race to complete before the mutators produce more work. Portions of the Pauseless algorithm also feature a “self-healing” behavior which limits mutator overhead and reduces mutator sensitivity to the current GC state.

We present the Pauseless GC algorithm and the supporting hardware features that enable it

Bio: With more than thirty years experience developing compilers, Cliff serves as Azul Systems' Chief JVM Architect. Cliff joined Azul in 2002 from Sun Microsystems where he was the architect and lead developer of the HotSpot Server Compiler, a technology that has delivered dramatic improvements in Java performance since its inception.

Previously he was with Motorola where he helped deliver industry leading SpecInt2000 scores on PowerPC chips, and before that he researched compiler technology at HP Labs. Cliff has been writing optimizing compilers and JITs for over 25 years. He is invited to speak regularly at industry and academic conferences including JavaOne, ECOOP, JVM and VEE; serves on the Program Committee of many conferences (including PLDI and OOPSLA); and has published many papers about HotSpot technology and more than a dozen related patents. Cliff holds a PhD in Computer Science from Rice University.

Oct 6: Karu Sankaralingam (University of Wisconsin)
Defying Dark Silicon with Idempotent Processors
Website | Poster | Slides (I2PC-internal only)

Abstract: Technology constraints are radically changing as we scale to the end of silicon technology. In this talk, I will first describe how "dark silicon" may bring Moore's law to an end.  Unless we embrace radical microprocessor designs, performance improvements in next 10 years will be a meager 8X, more than half the transistors on chips must remain turned off, and the decreasing reliability of transistors further exacerbates these problems. Transistor innovations, architecture innovations, or application innovations, alone are insufficient to deliver the ``expected'' Moore's law speedup of 32X. While these predictions are dire, synergistically exploiting the changing application trends can help overcome these challenges. My research fundamentally rethinks microprocessor designs with energy and reliability as primary constraints. In this talk, I will describe formal concepts and practical systems we have built in my group, spanning FPGA prototypes to full-fledged compilers.

I will introduce and discuss in depth the concept of application idempotence i.e. programs naturally decompose into a continuous set of regions, where each region can be re-executed to produce the same result. This property can be leveraged to provide efficient recovery from various different types of faults. I will describe a compiler framework that can automatically identify such regions. Using such a compiler, we develop the Idempotent Processor Architecture, whose primitive execution block is an idempotent code region. With the capability of recovery by simply re-executing, an Idempotent Processor design executes efficiently and correctly under various constraints, including faulty hardware, control mis-speculation, out-of-order retirement etc.  all while avoiding energy-consuming hardware support like checkpoints, re-order buffer, load-store queues etc. I will conclude with long-term thoughts on extending this work to post-CMOS technologies.

Bio: Karu Sankaralingam is an assistant professor in the computer sciences department at the University of Wisconsin-Madison, where he also leads the Vertical Research Group. His research interests include microprocessor design and VLSI. He is a recipient of the NSF CAREER award. He earned a PhD from The University of Texas at Austin in December 2006.

Sep 29: Milind Kulkarni (Purdue)
Automatically Enhancing Locality in Irregular Applications
Website | Video | slides

Abstract: Over the past several decades of compiler research, there have been great successes in automatically enhancing locality for regular programs, which operate over dense matrices and arrays. Tackling locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, has been much harder, and has mostly been left to ad hoc, application specific methods. In this talk, I will describe efforts by my group to automatically improve locality in a broad class of irregular applications, those that traverse trees. The key insight behind our approach is an abstraction of data structure traversals as operations on vectors. This abstraction lets us design transformations, predict their behavior and determine their correctness. I will present two specific transformations we are developing, "point blocking" and "traversal splicing," and show that they can deliver substantial performance improvements when applied to several real-world irregular kernels.

Bio: Milind Kulkarni is an assistant professor in the School of Electrical and Computer Engineering at Purdue University. His research focuses on developing languages, compilers and systems that can be used to harness the power of emerging, complex computation platforms. Before joining Purdue, he was a postdoctoral research associate at the University of Texas at Austin from May 2008 to August 2009. He received his Ph.D. in Computer Science from Cornell University in 2008. Prior to that, he received his M.S. in Computer Science from Cornell University in 2005, and BS degrees in Computer Science and Computer Engineering from North Carolina State University in 2002. While at Cornell, he was a Department of Energy High Performance Computer Science (HPCS) Fellow from 2004 to 2008. He is a member of the ACM and the IEEE Computer Society.

Sep 22: Pranav Garg (UIUC)
Scalable checkpointing for coherent shared memory
Website | Video | Slides

Abstract: As we move to large manycores, the hardware based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced workloads. Scalable checkpointing mandates tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this talk will present Rebound, the first hardware-based scheme for Coordinated Local Checkpointing in multi-processors with directory-based cache coherence. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.

Bio: Pranav Garg ( is a Ph.D. candidate in the department of Computer Science at the University of Illinois at Urbana-Champaign. His research interests include program analysis for automatic software verification. Before joining UIUC, Pranav Garg recieved his B.Tech degree from the Indian Institute of Technology Kanpur in 2009.

Sep 15: Amin Ansari (UIUC)
Overcoming hard-faults in high-performance microprocessors
Website | Video | Slides

Abstract: As devices get smaller, they become more susceptible to hard faults. To address this challenge, this talk will present two approaches: Archipelago and Necromancer. To protect the cache from hard faults, we present Archipelago, a highly reconfigurable cache design that resizes to provide spare elements. Furthermore, to maximize the effective cache capacity in low-power mode, a near optimal minimum clique-covering configuration algorithm is introduced. To protect the core area against hard-faults, we introduce Necromancer, a robust and heterogeneous core coupling execution scheme. Although a faulty core cannot be trusted, we observe that, for most defects, execution traces on a defective core coarsely resemble those of fault-free executions. Consequently, Necromancer exploits a functionally-dead core to improve system throughput by supplying hints regarding high-level program behavior.

Bio: Amin is an associate at the Computer Science Department of the University of Illinois. He received his Ph.D. from the Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor. He received the College of Engineering Distinguished Achievement Award at the University of Michigan. He also received the Best Paper Award from 2009 International Conference on Computer Design. His research interests include designing architectural and microarchitectural techniques for enhancing reliability and power efficiency of high-performance microprocessors in deep submicron technologies. Dr. Ansari received the B.S. degree in Computer Engineering from Sharif University of Technology, Iran in 2007 and the M.S.E. degree in Computer Science and Engineering from the University of Michigan in 2008. He is a member of the IEEE and the ACM.