POWER2 Fixed-Point, Data Cache, and Storage Control Units

D.J. Shippy, T.W. Griffith, and Geordie Braceras

An early version of this paper has been submitted to the IBM Journal of Research and Development.


Contents


Introduction

The multichip POWER2 processor implementation provides industry leading performance in both floating-point and fixed-point applications [ 1]. Three of the chips on the multichip module, the Fixed-Point Unit (FXU), Data Cache Unit (DCU), and Storage Control Unit (SCU), provide a tightly integrated subsystem that avoids bottlenecks in the cache, memory, and I/O interfaces. The balanced system design allows POWER2 systems to excel on both technical and commercial applications.

The FXU, DCU, and SCU functionality and system structure are similar to those in a POWER implementation [ 2, 3]. This paper presents the FXU, DCU, and SCU designs, as well as the memory and I/O interfaces, found in the POWER2 system. The new implementation includes the following improvements:

Fixed-Point Unit (FXU)

Figure 1 shows a block diagram of the POWER2 FXU. The FXU decodes and executes all instructions, except for branch and Condition

Register logical instructions, which never leave the Instruction Cache Unit (ICU), and floating-point arithmetic instructions, which are executed by the FPU. Fixed-point and floating-point instructions are dispatched by the ICU on the instruction bus (IBUS) to the FXU and FPU simultaneously and are executed in parallel in the FXU and FPU. The FXU contains the address translation, data protection, and data cache directories for all load/ store instructions.

The FXU receives up to four instructions from the ICU over the IBUS shown in the upper left section of Figure 1. The instruction buffer unit queues instructions for the two decode units. The decode units decode the instructions and issues them to the two execution units. The decode units also control the general purpose registers (GPR). The architecture defines thirty- two 32-bit GPRs. Hardware keeps consistent two copies of the GPRs, one for each execution unit. For load/store operations, the address translation logic converts virtual addresses to real addresses, and the data cache control unit controls the data cache and its directory. The Processor Bus (PBUS) unit provides the interface to other POWER2 processor chips.

Instruction Buffer

The ICU dispatches instructions to the FXU and FPU across the four-instru ction wide (4 x 36 bits) IBUS. Associated with each instruction is a valid bit and a set of three tag bits that provide additional information about the instruction. On each cycle, the FXU moves valid instructions and their tags from the IBUS to the eight-entry FXU instruction buffer and queues them for decoding and execution. The FXU limits the number of instructions transferred by informing the ICU of how many entries exist in its buffer.

An instruction's valid bit from the ICU is further qualified by the status of pending branches, ICU instruction cancels, and other related conditions, creating a r eal valid bit for the instruction. The FXU may cancel the instruction by resetting the re al valid bit when:

Instruction Decode

The two instruction decode units have the following responsibilities:

The two decode units are identical. For each instruction, a decode unit combines the primary and extended opcode fields into a single 10-bit field. The decode and execution units use this field to decode all instructions. At the end of the decode cycle, this combined opcode is latched for use during the execute cycle.

During the decode cycle, three (or four for the second execution unit) GPR values are read according to the specified register source and target fields of the instruction. If the data required during the execute cycle is not in the GPR, performance can be improved by routing the required data from its source directly to the execution unit (as well as to the GPR). Three instances of this technique, called a bypass , are implemented.

Results from the arithmetic logic unit (ALU) can bypass the GPRs when a register to register (RR) operation is dependent on the results of an RR operation that executed in the previous cycle. Data from the PBUS can bypass the GPRs when a load from I/O or a move from special purpose register type instruction is followed by a dependent operation. Data from the cache can bypass the GPRs when a load from memory space occurs.

The decode units also manage the issuing of instructions to the two execution units. In particular, the units resolve register dependencies and issue some operations (such as string and load/store multiple operations) to both execution units. The decode units also control the setup for the three-leg adder operations, as explained in the next section.

Execution Control Unit

The two execution control units control load and store execution, manage holdoffs for operands that have not arrived to the execution unit, and write results to the GPRs. Each unit controls its corresponding execution unit.

Execution Unit

The FXU contains two fixed-point execution units, which provides the capability to execute two fixed-point instructions per cycle - twice that of the POWER processor. Both units contain one adder and one logic unit functional block, which provides each unit the capability to execute all fixed-point arithmetic (except multiply and divide) and Boolean operations. Execution Unit 0 also performs special operations such as cache operations and all privileged operations. Execution Unit 1 performs multiply and divide. Other responsibilities for the execution units include performing the data transformations required by fixed-point RR operations, computing the effective address for all storage references, and providing data flow controls during the execution of move to" and move from" special purpose register instructions. Each execution unit is controlled by its corresponding execution control unit and decode unit, using the derived 10-bit field as the primary control interface. Performance enhancements to the POWER2 execution unit include improved multiply and divide performance, as well as support for parallel execution of dependent add operations.

The multiply/divide unit has been enhanced over POWER [ 4 ]. The multiply array supports 2-cycle operations for all multiply instructions ( mul, muls, muli), an improvement over POWER, which takes three to five cycles for a multiply. The two divide instructions ( div, divs ) execute in 13-14 cycles on POWER2 compared to 19-20 cycles for POWER. The div instruction may require three extra cycles if the algorithm converges from above. When the divisor for the div instruction is the most negative number (0x80000000), two extra cycles are required.

A three-leg adder, implemented in the second execution unit, improves performance by allowing parallel execution of dependent add operations. In the following code sequence, Execution Unit 0 adds R2 and R3 and stores the result in R1. Instead of waiting for the result of the first add, the second execution unit's three-leg adder adds R2, R3, and R5, storing the result in R4.

		A R1,R2,R3
		A R4,R1,R5

Synchronization of FXU and FPU

Synchronization between the FXU and FPU ensures the integrity of the association between data and the instruction that operates upon it. For example, on a floating-point load ( lfd ) instruction, it ensures that the data fetched by the FXU is loaded into the correct floating-point register (FPR). In both the POWER and POWER2 implementations, data integrity is maintained by synchronizing on all floating-point loads; a floating- point load executes in the FXU during the same cycle that the rename stage in the FPU is selecting a new physical register for the load's target register. Synchronization also helps preserve precise interrupts by ensuring that the FPU does not execute an interruptible operation (IOP), or subsequent instructions, before the FXU indicates that the execution may proceed. POWER implementations use two mechanisms to preserve precise interrupts [ 4]. An interruptible instruction latch in the FPU ensures that the FPU never executes an IOP ahead of the FXU. The FXU may not execute an IOP until the instruction reaches the FPU rename stage. A counter, indicating the relative execution positions of the FXU and FPU, limits how far either unit can be ahead of the other. The counter-based synchronization scheme relies on the FXU and FPU seeing all instructions on the IBUS.

In POWER2 implementations, the FXU does not see FPU arithmetic operations and the FPU does not see FXU arithmetics operations. Therefore, a queueing scheme was devised to allow precise interrupts. As in POWER, the FPU may not execute IOPs ahead of the FXU. However, the synchronization has been relaxed to allow the FXU to execute all operations, except the floating-point loads, ahead of the FPU [ 5]. Thus, the FXU can execute all operations except floating-point loads ahead of the FPU and the FPU can execute all operations except IOPs ahead of the FXU. As a result, the POWER2 FXU can execute further along the instruction stream and, under certain conditions, provide data to the FPU in fewer cycles.

Address Translation

POWER2 implements a load/store architecture. Only load/store operations transfer data between the CPU and memory. No arithmetic operations reference storage. To provide the high data rates required by the pipelined processor, a high-speed data cache is placed next to the CPU. Most load/store operations can be served by the cache without degrading the FXU pipeline utilization. If the data is not in the cache, the data is fetched from the main memory. If the data is not in main memory, a page fault is taken and the data is retrieved from mass storage (hard disk). Like POWER systems, the major features of the POWER2 storage mechanism are:

For memory references, an effective address (EA) is first translated to a virtual address and then to a real memory address. The POWER2 address translation has been improved over that of the POWER design. In the new scheme, often a single memory reference fetches the required page table entries (PTEs). PTEs can be cached. POWER implementations require two memory references and can not cache PTEs. The POWER2 address translation scheme is shown in Figure 2. The 32-bit EA is converted to a 52-bit virtual address as follows:

  1. Use EA bits 0-3 to address into one of the sixteen Segment Registers (SR[0..15]).
  2. Concatenate the 24-bit SID field of the accessed Segment Register with EA bits 4-31.

This 52-bit virtual address is then converted into a 32-bit real address (RA) using the Hashed Page Table (HTAB).

The HTAB contains a maximum of 2 ** 19 Hash Table Entry Groups (HTEGs). Hashing the virtual address produces a pointer to the first of two HTEGs that could contain the translation for the virtual address. If the translation is not found in the initial HTEG, the virtual address is rehashed and a secondary HTEG is searched. Each cache-line aligned HTEG contains eight PTEs. Each 2-word PTE contains fields for the Segment ID (SID), abbreviated Virtual Page Index (AVPI), Real Page Number (RPN), page protection bits, and reference and change bits. The translation between virtual address and real address is defined by the PTEs in the HTAB. Conceptually, the address relocation hardware searches this table to translate each reference. However, for performance reasons, the hardware keeps a translation look-aside buffer (TLB) which holds PTEs that have been recently used. The TLB entries are organized like a PTE entry; hence, the TLB can be considered as a cache that contains a subset of the HTAB. The TLB is searched before referring to the HTAB in storage. As a consequence, when software makes changes to the HTAB, the software must issue the appropriate TLB invalidate instructions to maintain the consistency of the TLB and the HTAB. When a TLB miss occurs (that is, no matching PTE is found in the TLB), the FXU searches the HTAB for the desired PTE. To find the first HTEG that might contain the PTE, the FXU calculates the HTEG address using the hashed virtual address, as well as a HTAB origin address and mask from Storage Description Register 1. The FXU searches through the first HTEG until a matching PTE is found. A match" occurs if the valid bit is active, the SID in the Segment Register is identical to the SID of the PTE (bits 1-24; word 0), and bits 4-8 of the EA match the AVPI of the PTE (bits 27-31; word 0). If a matching entry is found, the RPN (bits 0-19) contained in word 1 of the PTE is concatenated with the offset (bits 20-31) of the EA to form the 32-bit real address. However, if no match is found in the first set of eight PTEs, then a secondary HTEG address is hashed and the search repeats as previously described. The eight PTEs in the secondary HTEG are checked until a matching entry is found. If no match is found, the translation fails, a page fault occurs, and a Data/Instruction Storage Interrupt is generated.

If both execution units receive a load/store, a TLB miss from Execution Unit 0 will always be resolved before a miss from Execution Unit 1 can be resolved. Furthermore, any interrupt caused by one execution unit will not interfere with the translation of a miss for the other unit.

Translation Look-Aside Buffer (TLB)

The POWER2 data TLB implementation contains several improvements over the POWER implementation. First, the number of entries has been increased from 128 entries to 512 entries. Second, the TLB contains dual ports, allowing the two execution units to access TLB entries in parallel. The data TLB is a two-way set-associative design; each entry consists of two words (word 0 and word 1 of the PTE). Hardware automatically reloads TLB entries on a miss and updates the reference, change, and Data Locking bits in the HTAB. Because the FXU contains the only path to the D-cache, the FXU performs all TLB reloading and HTAB updating for the ICU.

Data Cache Control and Directory

The Data Cache Control and Directory unit is responsible for controlling loads and stores for the FXU and FPU. Design features include increased data path bandwidth and D-cache support for multiple capacities and line sizes in a nonblocking store-bac k design.

The D-cache pipeline for a fixed-point load begins in the execution cycle with the access of the directory and status arrays. The cache address tag and the TLB real page number are compared (along with the associated control, TLB hit, and cache valid bits) to form the cache late selects. Also in this cycle, the D-cache address is launched from the FXU and captured in the DCU. In the next cycle, the cache access cycle, the D-cache is read and the FXU's late select signal instructs the DCU multiplexer to send the desired data to the FXU. This is the only two-chip crossing path in the entire processor complex. The data which arrives at the FXU is formatted and latched in the D-latch (where it can update the GPRs) and may be bypassed to the execution cycle input latches.

The Data Cache Control and Directory unit supports two design points. The high performance design point incorporates a 256-byte line, 256KB D-cache with a 8-word memory interface. The low-cost design point incorporates a 128-byte line, 128KB D-cache with a 4-word memory interface. Both design points implement a four-way set-associative D-cache; therefore, the corresponding cache sets are 64KB and 32KB, respectively. The longer line size improves performance on sequential data accesses. However, because a cache set is 16 pages in size, the lower 4 bits of the page address are required to index into the cache. This scheme places the following restriction on the operating system: any data referenced with translation on , and then again with translation off, must keep the cache address portion of the virtual and real addresses equal. This aliasing restriction eliminates the chance of the same data being located in the cache in two locations.

Memory bus bandwidth is augmented by a store-back D-cache design with two change bits per line. To handle a cache store-back operation, a 256/128 byte Store-Back Buffer is implemented to hold the data until the memory bus is available.

With the increased POWER2 computing power, the data path bandwidth has been increased to prevent the data access from becoming a bottleneck. The D-cache logic path is fully dual ported from the directory arrays and D-TLB to the D-cache itself. This allows the processor to execute two load/store instructions per cycle. A three-port adder in the EA generation path provides the capability to execute two update form Load/Store instructions in parallel. Each data port to the FXU is a single-word wide, allowing two independent data accesses. The new floating-point quad-word load and store instructions, matched with two quad-word wide buses to the FPU, give the processor the ability to move 4 double-words per cycle into the FPU. The D-cache custom array is capable of aligning data on a double-word boundary so quad-word accesses need only be on a double-word boundary.

The POWER2 D-cache is a nonblocking design; the D-cache can still be accessed on one port while the other port resolves a cache miss. A second miss blocks all accesses. Each port functions similar to the POWER single-port design with many of the same dataflow structures duplicated for the additional port. The following sections describe the data flow in more detail.

Load/Store Dataflow

Figure 3 shows the load FXU/DCU data flow. The D-cache logic (including directories), the D-cache arrays, and the status array (not shown), are fully dual ported. The design is dynamic; either port may be driven from E-Unit 0 or E-Unit 1. This is a non-blocking D-cache; data can still be accessed from the D-cache with one outstanding miss.

Figure 4 shows the fixed-point store FXU/DCU data flow. Fixed-point stores can be executed 2 per cycle. The addresses for a store are placed into the FXU Pending Store Queue (XPSQ). Entries from the XPSQ can be written into the D-cache either 1 or 2 per cycle using any available port. The XPSQ is a non-overrunable queue; its first priority is clearing entries. Data is transferred from the FXU to the DCU during the cycle after which a store is executed; the data is placed into a FXU Store Data Register (XSDR) on the DCU. Whenever a load/cache operation is executed, its address is compared to all entries in the XPSQ to check for a match. All compares are performed using bits 14-29 of the EA.

Floating-point stores can be executed 2 per cycle. The addresses for stores are placed in the floating-point Pending Store Queue (FPSQ) located in the FXU. Store requests in the FPSQ can be written into the D-cache either one or two per cycle using any available port. The FPSQ is an overrunable queue capable of stopping execution of floating-po int stores. The FPSQ has a lower priority than the XPSQ. Data can be transferred from the FPU to the DCU after the FXU receives a data-ready signal from the FPU depending upon bus availability. Data is placed into a floating-point Store Data Register (FSDR) on the DCU. Whenever a fixed-point load/store/cache operation is executed, it is compared to all entries in the FPSQ to check for a match. All compares are performed using bits 14-29 of the EA.

Reload and Store-Back Dataflow

When the data is not found in the D-cache, a reload operation moves data from memory to the D-cache. If the D-cache destination for the new data contains data which has been previously modified, then a store-back operation moves the modified data to memory. The D-cache reload function is accomplished through a third (reload) port on the D-cache array. The DCU is given a reload command specifying the address and set. As memory data arrives, it is written into the D-cache in the second half of the cycle. Load-through data is bypassed from the memory data latch and sent to the FXU or FPU in the first data cycle of all loads. If an ECC error is detected for bypassed data, the FXU or FPU will retry the request; the second data cycle will contain corrected data from the D-cache. D-cache reloads are based on a true LRU algorithm with the memory bus delivering 8 (4) words of data per cycle to fill a 256 (128) byte cache line in eight cycles. The cache line is fetched in a wrap-around fashion in which the first 8 (4) words from memory contain the referenced data. The memory data is loaded directly into the D-cache on the reload port. This additional cache port provides minimal processor performance loss during a reload operation.

The D-cache store-back function supports half-line granularity by maintaining one change bit per half line. The DCU contains two cache line Store-Back Buffers, SBB0 and SBB1. The two buffers allow optimal performance on reloads. The SBB0 is used to postpone write functions to the memory. The FXU cache control will pass the reload command to the SCU immediately on a miss with one outstanding store-back in SBB0. The SCU will perform reads before writes and postpone store-back operations to give additional reload performance.

The SBB1 has the additional capability of being able to be written from the XPSQ and the FPSQ. The XPSQ and FPSQ are not checked before issuing a reload command to the SCU. Once the reload has been given to the SCU, the control logic moves the replacement line to SBB1. The control logic checks the XPSQ and FPSQ against the replacement line. If there is a match, the stores are done to SBB1. SBB1 is then moved to SBB0 if it is available. If not, the control logic holds until SBB0 is unloaded to memory and then moves SBB1 to SBB0.

Data Cache Unit (DCU)

The DCU consists of four identical chips which provide a four-way set-associative multiport store-back cache. The DCU supports two design points. The first consists of a 256-byte line, 256KB D-cache with an 8-word memory interface. The second consists of a 128-byte line, 128KB D-cache with a 4-word memory interface. As shown in Figure 5, the DCU also provides several buffers for cache and DMA operations, as well as error detection/correction and bit steering for all data sent to and received from memory.

The DCU provides a 128-byte instruction reload buffer (IRB) for transferring instruction cache lines to the ICU, as well as store-back buffers for data cache operations. Data cache reload buffers are built into the data cache array macro. The DCU also provides an I/O cache for DMA operations. This cache holds up to four I/O cache lines and is controlled by the SCU. The following section describes the data cache array macro. The other DCU functions are described in more detail in the SCU section.

Data Cache Array Macro

The data cache array is a four-way set-associative 64KB dual-port array, with support for half-line store-back operations, as well as support for quad-word access on a double-word boundary. To meet the demands of the dual execution units, the data cache array macro has been enhanced over the previous cache designs [ 6 ]. The cache is a multiported design which uses a virtual multiport technique and a standard single-port cell macro. This technique has kept the size of the array small while providing multiple ports. Other features of the array are line zeroing, port swapping, unaligned access, and an array built-in self test (ABIST). The data cache array macro has three unique ports. There are two 36-bit read/write ports (Port 0 and Port 1) and one 72-bit write-only port (CRB port). The cache also has a 288-bit read-only port designed for storing back cache lines. The virtual multiport technique provides a full three-port array to the outside logic, while internally it pipelines three sequential cycles within one processor cycle. The first two read/write cycles are always performed. The third cycle is for the CRB port and is only performed during cache reloads. The CRB port is used for loading data from the memory bus into the cache. Memory data is loaded one word per cache macro per cycle in a four-word memory system, and two words per cache macro per cycle in an eight-word system.

A port-swap feature minimizes the delay for a read to port 1 when preceded by a write to port 0. The feature swaps the port operations to guarantee that a port 1 read will never follow a port 0 write. This swap allows the RAM to take advantage of the faster array recovery following a read, and start the port 1 access earlier than if it had followed a write cycle. The port-swap circuitry identifies when port 0 is writing, and then it simply reverses the internal port clocks. Forcing port 0 to occur second, whenever a write occurs, allows it to become the priority port during the double write case. The reload write maintains priority over the two execution unit ports. The cache supports both aligned and unaligned accesses. The cache can read (write) from (onto) the cache-to-processor buses on double-word boundaries. Because each DCU chip provides four bytes of a quad-word, each array requires access to data on either a word or half-word boundary. The cache is organized to read/write aligned data," such as a word (bytes A1, B1, C1, D1), or data on a half-word boundary. In this case, bytes C and D from word 1 can be merged with bytes A and B from word 2 to form the word C1|D1|A2|B2. The RAM increments the address for half of the bytes and then swaps the data between the upper and lower bytes for proper alignment. The last word on the line is not valid for this function; therefore, data misaligned across cache lines requires two RAM accesses. The virtual multiport cache is designed to be logically equivalent to a real multiport array. This requires a compare-bypass feature for the two read/write ports to guarantee that the execution unit receives the last data written when the ports simultaneously read and write the same address. Because the port-swapping feature will force the read access to occur first, an address comparator and data-in/data-out multiplexor is included to identify when an address collision has occurred so it can bypass written data to the previously read port. The comparator not only identifies when the addresses are equal but also when they are adjacent along a cache line. This is necessary when one access is aligned and the other is misaligned. Although the addresses are different, portions of the two accesses may overlap, and the comparator must be able to bypass half of the bytes during a read/write cycle.

To permit the software to initialize lines in the cache, a mechanism is provided where the FXU can zero-out cache lines. This feature allows lines in the cache to be initialized without requiring the line to be transferred from memory. This initialization is significantly faster than a series of stores with zeroes for data.

Storage Control Unit (SCU)

The main function of the SCU is to control the communication between the processor complex and the other system units: the I/O control units, the main memory unit, and the IPL read-only storage (ROS) unit. The SCU interfaces with the FXU and ICU processor chips across the PBUS, with I/O and ROS over the SIO bus, and with main memory using the memory address and control buses. Each of these interfaces has a unique set of control signals. In addition to managing these interfaces, the SCU contains logic for external interrupts and the performance monitor.

Figure 6 shows a high-level block diagram of the SCU. The SCU logic consists of the following areas: PBUS interface, SIO bus interface, memory interface, ROS interface, performance monitor, and external interrupts. The memory interface is further broken down into cache reload and store-back operations, memory scrub operations, error handling, and bit steering.

PBUS Interface

The PBUS interface supports three types of operations: memory, I/O load/store, and move. Memory addresses are moved from the PBUS into the PBUS Memory Queue and then moved out to main memory via the memory row/column address generation logic. I/O load/store operations to the SCU are used to read and write SCU registers, DCU registers, and I/O registers. The only SCU registers which can be read and written from the PBUS are the performance monitor registers, bank configuration registers, external interrupt registers, SCU control registers, and error registers. For I/O load/store operations to the DCU and I/O registers, the 128-byte PIO buffer is used to move data between the PBUS and SIO bus. Move operations transfer the Interrupt Level Control Register (ILCR) to/from the PBUS in a single cycle.

Memory Interface

The memory interface is a high-speed synchronous split address/data bus which allows the processor, as well as I/O devices, to access main memory. The two primary changes to the memory interface are support for both a four-word and eight-word memory to D-cache interface, capable of transfer rates over 2000MB/S, and support for three cache line sizes. The memory interface also improves performance through its memory request queuing schemes and reload/store-back strategies. Similar to POWER, this implementation enhances reliability with memory scrubbing, ECC, and bit steering.

D-Cache Line Size Support

The POWER memory interface supports 64-byte lines for the I/O and instruction caches and 128-byte lines for the D-cache. The POWER2 design supports both 128-byte and 256-byte D-cache lines while providing both 64-byte support for the POWER2 I/O cache line and 128-byte support for the I-cache. A new protocol was required for not only four-cycle and eight-cycle transfers, but also two-cycle transfers. In addition, a new real-to-DRAM address translation was required by the SCU for the 8-word system. This translation is generated in a single cycle as will be described later.

Memory Configuration

The memory interface supports both a lower-cost 4-word configuration and a high performance 8-word configuration. The 4-word interface maintains compatibility with the existing memory cards and POWER's I/O subsystem, creating a stable interface for debugging the POWER2 processor chips. The high performance 8-word interface supports POWER2's improved processing capabilities. This unique memory interface selects the mode based on the number of installed memory cards.

Memory cards are two words wide and a minimum configuration consists of two cards. When two additional cards are installed, the memory interface becomes an 8-word bus. The design provides the customer an opportunity to buy a system with a minimal set of memory cards. With no changes to the planar, hardware, or software, the customer can add memory cards, providing both a wider data bus and a larger memory. The wider data bus doubles the memory performance.

The hardware automatically detects the number of memory cards present and establishes the width of the memory data bus. The On Card Sequencer (OCS) monitors a Memory Card Detect signal to determine how many memory cards are present. The OCS interfaces with the common on-chip processor (COP) logic [ 7 ] in the processor chips over a COP bus to configure the system. During IPL, the OCS initializes a mode latch in the FXU, SCU, and DCU that the processor chips use to determine a memory transfer's data width and number of cycles.

Memory Request Queues / Controls

To improve storage bandwidth and latency, the SCU queues and prioritizes memory requests and controls memory access. The SCU maintains three memory request queues. The first queue holds up to three processor requests, the second holds two I/O DMA requests, and the third holds one memory scrub request. Three corresponding address generators create the address and the bank selects for next request on the queue. The SCU arbitrates for the memory bus in parallel with the address generation. The SCU's memory arbiter grant logic selects one of three requests. The memory arbiter prioritizes the requests in the following order: DMA requests are highest in priority, followed by processor requests, followed by memory scrub operations. If back-to-back DMA requests are active along with processor requests, the arbiter grants the two DMA requests first, followed by one processor request.

While the arbiter is generating the bus grant, the bank select logic determines which one of 16 memory banks to activate. This logic compares the upper address bits of the real address with the base address bits in the bank configuration registers. The number of bits that are compared depends on the size field in the configuration register. If the addresses match, the bank select for that register is activated and the transfer is completed.

The memory interface control reduces latency on back-to-back requests. By allowing two memory operations to be pending at any given time, the memory card begins to process the second request before the first is complete.

Cache Reloads and Storebacks

When cache misses occur, cache reload and store-back operations move instructions and data between memory and the ICU, DCU, and FXU. These operations are jointly executed by the FXU, DCU, ICU, and SCU. D-cache miss performance is improved by implementing a load-through path for reloads, a store-back buffer which allows reloads to occur in parallel with a store-back operation to the buffer, and a high-priority reload feature. Similar to POWER, the I-cache miss sequence routes the data through the DCU to the ICU, reducing the pin count, and providing the DCU's error detection and correction (ECC) coverage.

D-cache miss and store-back requests are initiated by the FXU and are sent as a processor request to the SCU. The SCU controls the 4-word or 8-word transfers from memory to the DCU. As shown in the top of Figure 5, data passes through the bit steering logic before being sent along two data paths. The ECC logic path goes to the D-cache. The load-through path bypasses the D-cache, sending data directly to the FXU and FPU data buses. When a new line of data is brought into the DCU, the word that satisfies the request is brought in first, minimizing latency. When the end of the line is reached, the first word of the line and the remaining sequential words are fetched until all eight 4-word or 8-word data packets arrive in the DCUs.

The high-priority reload operation provides a performance advantage for all D-cache miss operations that require the cast-out of a dirty line. For these operations, two events need to occur: data from the D-cache must be written back to memory and data from memory must be stored into the D-cache. From a programmer's view, the data returned from memory is highest in priority. The data written back to memory is no longer needed. The high-priority reload design hides the cache line store-back penalties on the memory bus. When the SCU queues the cache reload and store-back requests, the reload and store-back addresses are monitored. If the addresses are not for the same cache line, the reload is given higher priority. The store-back operation must wait to access the memory bus until there are no reload requests pending.

I-cache miss requests are initiated by the ICU and are sent as a processor request to the SCU. The SCU controls the 4-word or 8-word data transfers from memory to the 128-byte I-cache reload buffer in the DCU. The SCU controls the order in which data is loaded (quad-word 0 or quad-word 1 first) and when data is sent on the I-cache reload bus. The I-cache reload memory data passes through the DCU's bit steering and ECC logic. To reduce latency, the first quad-word of the memory data includes the instruction requested by the ICU. A wrap-around load of the instructions is performed.

Memory Scrubbing

To reduce the chances of an unrecoverable failure, the processor hardware provides a software-controlled memory-scrubbing function that attempts to find and correct single-bit errors before they become double-bit errors. The software uses three registers, located in local I/O space, to control the scrub function: the Scrub Start Address Register (SSAR), the Scrub End Address Register (SEAR), and the Scrub Timer Value Register (STVR).

The scrub sequence consists of three memory transfers: a read operation to detect errors, a write operation to correct the errors, and a read operation to verify that the data has been corrected. If no errors are detected during the initial read operation, the subsequent write and read operations are not executed. When an error is detected, the hardware records the type of error (soft, hard, or uncorrectable) and the address where it was detected.

Error Detection and Correction

The ECC logic allows the DCU to correct all single-bit errors and to detect all double-bit errors. The system memory bus is divided into either eight or four ECC words; each DCU chip receives either one or two words per data transfer. The ECC word contains 32 data bits, 7 check bits, and 1 spare bit. Each word is encoded with a modified Hamming code when written to memory and is checked for errors when read from memory. If a single-bit error is detected, the DCU corrects the data and writes the ECC syndrome (8-bit code which indicates which bit failed) into a register. This register and the corresponding failing address register in the SCU can be read by software to isolate the memory failure to a particular memory bit.

Bit Steering

Bit steering improves memory reliability by providing a hot stand-by " bit if a memory bit fails. The SCU enables bit steering when a hard error in memory is detected through ECC and memory scrubbing. The SCU sets one of the Bit Steering Configuration Registers (BSCR) to indicate to which data word position the spare bit should be steered. Each DCU contains 16 BSCR registers, one for each memory bank. Each 8-bit BSCR contains the ECC syndrome. The DCU steers the spare bit into any data or check bit position within the ECC word during transfers to memory or from memory.

Read-Only Storage

The ROS, located on the system planar, provides the code and data required to initialize the system and perform various diagnostic tests. This type of memory is typically separate from the larger main memory discussed earlier. For example, the POWER ROS interface used a separate address and data bus. Packaging the POWER2 on a multichip module limits the processor to 512 functional signals, leaving few signals available for the ROS interface. POWER2 uses the SIO bus as a ROS address bus, eliminating the need for a unique bus.

POWER2 reserves the upper 1MB of the system address space for ROS. When the SCU detects an address in this range, it arbitrates for the SIO bus and drives the ROS address. The SCU controls the transfer of data from the ROS to the DCUs. When a full memory bus width (four or eight words) has been received, the data is written to either the I-cache or D-cache. The fact that the data has come from ROS is transparent to the ICU and FXU.

System I/O Bus

The SIO bus is a dedicated internal bus used for communication between the processing units and the I/O control units. The bus contains a total of 98 signal I/Os. The 86 bidirectional signals consist of a 72-bit multiplexed address and data bus (which includes 8 bits of parity), and an 8-bit control bus with one parity bit and five control tags (address valid, data valid, acknowledge, processor lock, and checkstop). The other 12 unidirectional signals, (four bus requests, four bus grants, and four busy signals) are used for SIO bus arbitration, and by the I/O control units to hold off I/O transfers.

The SIO Bus supports the following transfers: I/O loads and stores, DMA block transfers, and I/O interrupt requests to the Processing Unit. All DMA transfers and I/O store transfers are single envelope ; the current transaction in progress must be completed before a new request is honored. All I/O load transfers have a disjoint reply packet. The processor issues the I/O load request for one of the I/O control units and then releases the SIO bus while it waits for the load data. When the load data is ready, the I/O control unit requests the SIO bus and transfers its data to the processor.

System I/O Direct Memory Access

The POWER2 I/O system overcomes many of the POWER bottlenecks in moving data between main memory and I/O devices, such as disk controllers and LAN adapters. Improvements include increased I/O transfer rates, support for more I/O controllers, I/O prefetch, and an I/O cache. First, a DMA sequence is described.

Data is moved between memory and I/O devices using 64-byte DMA read and write requests on the SIO BUS. These operations are initiated by sending a command and address to the SCU over the SIO bus. Data is routed through the I/O buffers in the DCU. The SCU uses a round-robin buffer selection scheme to choose which I/O buffer will be used for the operation. For DMA write requests, the SCU receives the 32-bit real address and 64 bytes of data from the I/O control unit. It then loads the data into an I/O buffer two words at a time and unloads the data either eight or four words at a time to memory.

The POWER2 SIO bus supports up to four I/O control units; each unit manages four to eight Micro Channel channels (or slots). Each channel can transfer up to 80MB/S using the Micro Channel Streaming Data protocol [ 8 ].

To support the increased I/O bandwidth, which results from the multiple I/O control units, the POWER2 DCU contains an I/O cache. The POWER implementation of a single I/O buffer [4] can not sustain the high data transfer rates on Micro Channel. POWER2's I/O cache improves both read and write performance. The I/O cache consists of I/O buffers in which data is prefetched ahead for DMA read requests from I/O bus units. This removes the memory card DRAM latency for DMA data and provides a continuous stream on the SIO bus. For prefetch operations, the SCU fetches the 64 bytes requested as well as the next sequential 64 bytes. For each new 64-byte request, the SCU unloads the data that was previously prefetched and fetches the next 64 bytes in parallel.

In addition, the I/O cache can buffer several I/O cache lines during a stream of DMA writes, hiding the latency of memory bus interference from the CPU. When access to the memory bus is obtained, the data is written in parallel with the loading of new cache lines.

External Interrupt Logic

The external interrupt structure provides a mechanism for some external event, such as an I/O device requiring service, to break the normal flow of instructions. POWER2 provides a new high-performance external interrupt mechanism that incorporates a hardware high priority detect, a priority mask, and a minimal set of single-cycle instructions.

POWER's external interrupt structure has two primary bottlenecks. First, the 64 interrupt bits are masked on an individual basis. When an interrupt occurs, the software interrupt handler iteratively loops through each bit of the interrupt register until the highest priority bit set is found. Second, the interrupt register is mapped into I/O space requiring a Segment Register to be set up every time a load or store to this register occurs. These I/O load/store instructions are inherently slow operations due to the setup and the handshake between the FXU and SCU.

To improve this scheme, the POWER2 external interrupt structure incorporates high priority detection, previously handled in software, in hardware. Additionally, the instructions used to interface to the interrupt logic changed from slow I/O load/stores to single-cycle move operations which operate on a set of special purpose registers. A data field in one of these control registers provides the capability to set, reset, and update the external interrupt hardware.

Performance Monitor

The SCU implements a centralized performance monitor, enabling a wide variety of POWER2 performance measurements [9]. The monitor consists of 22 software- accessible counters that monitor activity in each of the eight chips that make up the POWER2 processor. A performance monitor control register in the SCU selects the events to be monitored. This data could not be obtained using an external monitor because the processor is packaged on a multichip module where only the memory and SIO buses can be probed.

Summary

One of the primary goals of POWER2 is to improve performance by adding more execution units than those found in the POWER implementation while maintaining a balanced system that avoids bottlenecks in the cache, memory, and I/O interfaces. POWER2 achieves this goal in the FXU, DCU, and SCU by the addition of a fixed-point execution unit, a larger multiported data cache, an improved bus organization, and an improved I/O interrupt and DMA subsystem.

Acknowledgments

The authors would like to acknowledge several people who contributed to the POWER2 FXU, DCU, and SCU chips. Larry Thatcher, David Ray, Alex Spencer, Warren Maule, Roger Bailey, Mir Ali, Bert Williams, and Jennifer Le worked on the FXU logic design; Joaquin Fentanes, Jr. worked on the physical design. Geordie Braceras worked on the data cache array macro. Larry Howell, Gary Countryman, Tao Brown, and Robert Wagner worked on the DCU logic design; Mike Chung worked on the physical design. Doug Moran, Kurt Feiste, and Hakim Mosleh worked on the SCU logic design. Adrienne Kokoszka worked on the physical design. These teams also documented much of the technical details found in this article. Ed Silha and John O'Quin were hardware and software architects for the POWER2 chip set.

References

  1. Steven W. White and Sudhir Dhawan, POWER2: Next Generation of the RISC System/6000 Family," PowerPC and POWER2: Technical Aspects of the New IBM RISC System/6000, IBM Corporation, SA23-2737, pp. 8-18.
  2. H. B. Bakoglu, G. F. Grohoski, R. K. Montoye, The IBM RISC System/6000 Processor: Hardware Overview," IBM Journal of Research and Development, Vol. 34, number 1, January 1990, pp. 12-22.
  3. G. F. Grohoski, Machine Organization of the IBM RISC System/6000," IBM Journal of Research and Development, Vol. 34, number 1, January 1990, pp. 47-52.
  4. G. F. Grohoski, J. A. Kahle, L. E. Thatcher, C. R. Moore, Branch and Fixed-Point Instruction Execution Units," IBM RISC System/6000 Technology, SA23-2619, IBM Corporation, 1990, pp. 24-32.
  5. Troy Hicks, Richard Fry, Paul Harvey, POWER2 Floating-Point Unit: Architecture and Implementation," PowerPC and POWER2: Technical Aspects of the New IBM RISC System/6000, IBM Corporation, SA23-2737, pp. 45-54.
  6. William R. Hardell et al., Data Cache and Storage Control Units," IBM RISC System/6000 Technology , SA23-2619, IBM Corporation, 1990, pp. 44-51.
  7. Ion M. Ratiu, Pseudorandom Built-In Self-Test," IBM RISC System/6000 Technology , SA23-2619, IBM Corporation, 1990, pp. 74-77.
  8. James O. Nicholson, Micro Channel Features," IBM RISC System/6000 Technology, SA23-2619, IBM Corporation, 1990, pp. 52-55.
  9. E.H. Welbon, C.C Chan-Nui, D.J. Shippy, and D.A. Hicks, POWER2 Performance Monitor," PowerPC and POWER2: Technical Aspects of the New IBM RISC System/6000, IBM Corporation, SA23-2737, pp. 55-63.

Copyright (C) 1994 International Business Machines Corporation. All rights reserved.