Abstract

Title of dissertation: Microarchitecture and Compilation Support for
                       Clustered Instruction-level Parallel Processors

     Krishnan Kunjunny Kailas, Doctor of Philosophy, 2001

Dissertation directed by: Professor Ashok K. Agrawala 
                          Department of Computer Science, and
                          Dr. Kemal Ebcioglu
                          IBM Thomas J. Watson Research Center, NY.


Clustered instruction-level parallel (ILP) processors achieve short cycle
time and high performance by decentralizing and partitioning the complex
structures and centralized resources of traditional monolithic ILP processors.
In this dissertation, we present a number of microarchitecture ideas and 
compilation techniques that are essential to realize such high performance 
clustered ILP processors.

The primary goal of a code generator for clustered ILP processors is to
minimize the schedule length by scheduling operations in clusters, subject
to a number of dependent constraints.  Traditional code generation schemes
handle these constraints in multiple phases of code generation viz., the
cluster assignment, register allocation and instruction scheduling, or a
pair-wise combination thereof, often making assumptions such as infinite
resources or zero-time copy operations. These phase-ordered solutions have
several drawbacks, resulting in the generation of poor performance code.  
This dissertation presents CARS, a new code generation framework for
clustered ILP processors, which combines the three phases into a single
code generation phase, thereby eliminating the problems associated with
phase-ordered solutions.  A performance evaluation study using a prototype
implementation of CARS-based code generator shows that it is scalable
across a wide range of clustered ILP processor configurations and
generates efficient code for a variety of benchmark programs.

We present a new global on-the-fly register allocation scheme developed
for CARS. Our scheme neither requires nor depends on prior relative
ordering of operations.  This feature of our register allocation scheme
opens up a new class of optimization opportunities such as aggressive
out-of-order scheduling of non-excepting operations (akin to dynamic
scheduling) along with register allocation.

This dissertation presents a new partitioned register file architecture
which uses a caching register buffer (CRB) to cache frequently used remote
registers. Instead of allocating and copying remote register values to the
architected name space of a local register file as in prior schemes,
copies of the remote registers are stored in the CRB.  A new send-broadcast
(SENDB)  instruction primitive is proposed for concurrently writing to 
multiple CRBs. We present a fast on-the-fly scheduling algorithm to schedule 
SENDB operations.  Experimental results show that a small CRB-based 
partitioned register file can outperform traditional schemes.