Abstract Title of dissertation: Microarchitecture and Compilation Support for Clustered Instruction-level Parallel Processors Krishnan Kunjunny Kailas, Doctor of Philosophy, 2001 Dissertation directed by: Professor Ashok K. Agrawala Department of Computer Science, and Dr. Kemal Ebcioglu IBM Thomas J. Watson Research Center, NY. Clustered instruction-level parallel (ILP) processors achieve short cycle time and high performance by decentralizing and partitioning the complex structures and centralized resources of traditional monolithic ILP processors. In this dissertation, we present a number of microarchitecture ideas and compilation techniques that are essential to realize such high performance clustered ILP processors. The primary goal of a code generator for clustered ILP processors is to minimize the schedule length by scheduling operations in clusters, subject to a number of dependent constraints. Traditional code generation schemes handle these constraints in multiple phases of code generation viz., the cluster assignment, register allocation and instruction scheduling, or a pair-wise combination thereof, often making assumptions such as infinite resources or zero-time copy operations. These phase-ordered solutions have several drawbacks, resulting in the generation of poor performance code. This dissertation presents CARS, a new code generation framework for clustered ILP processors, which combines the three phases into a single code generation phase, thereby eliminating the problems associated with phase-ordered solutions. A performance evaluation study using a prototype implementation of CARS-based code generator shows that it is scalable across a wide range of clustered ILP processor configurations and generates efficient code for a variety of benchmark programs. We present a new global on-the-fly register allocation scheme developed for CARS. Our scheme neither requires nor depends on prior relative ordering of operations. This feature of our register allocation scheme opens up a new class of optimization opportunities such as aggressive out-of-order scheduling of non-excepting operations (akin to dynamic scheduling) along with register allocation. This dissertation presents a new partitioned register file architecture which uses a caching register buffer (CRB) to cache frequently used remote registers. Instead of allocating and copying remote register values to the architected name space of a local register file as in prior schemes, copies of the remote registers are stored in the CRB. A new send-broadcast (SENDB) instruction primitive is proposed for concurrently writing to multiple CRBs. We present a fast on-the-fly scheduling algorithm to schedule SENDB operations. Experimental results show that a small CRB-based partitioned register file can outperform traditional schemes.