REMAP-γ: A Scalable SIMD VLSI Architecture with Hierarchical Control
Doctoral thesis, 1997
While the clock speed of general purpose (uni-)processors has risen dramatically during recent years, this is not true for SIMD (Single Instruction stream Multiple Data streams) parallel processors. The reason is to be found in the structure of this type of architecture: long-range broadcasting of data, clock and control signals. Also, as syn- chronization in the architecture has traditionally been done using broadcasting of a global clock, clock skew has been one limiting factor to the clock speed. The advent of deep-submicron VLSI (Very Large Scale Integration) technology, where interconnect delay dominates over gate delay, further emphasizes the importance of resolving this "broadcasting bottleneck" problem, since broadcasting normally implies the distribu- tion of signals using long resistive (high delay) wires.
As a solution, this thesis presents and evaluates new principles for the organization of SIMD architectures, using a two-level hierarchical organization of the control path. A global Control Unit (CU) broadcasts instructions to all Processing Elements (PEs) in the array at the word-level. These instructions are then executed by local CUs (one in each PE) at the bit-level. An up-shift in clock frequency from the global CU (low speed) to the local CUs (high speed) is thus used, relieving the global CU from broad- casting the high fan-out control signals at high speed.
Not only does this scheme enable the use of a high PE clock frequency, it also cre- ates the possibility for greater local freedom at the PE level, enabling very efficient "bit-level pipelined" instructions. These instructions permit various types of "global operations", e.g. data broadcasting, inner product calculations and minimum/maxi- mum searches, all done with very low latency.
Introducing hierarchical control in SIMD architectures, also enables locally gener- ated phase-locked PE clocks to be used. This approach has the potential of offering a low (array size independent) clock skew between adjacent PEs. The simulation results of one such implementation ("connected ringoscillators") are presented and discussed.
The REMAP-.gamma. SIMD architecture, using the hierarchical control organization, is presented and discussed.
Basic matrix calculations (e.g. matrix by vector multiplication), which form the most basic parts in signal processing algorithms, are mapped and their performance discussed on the REMAP-.gamma. architecture. Also shown is how some frequently used sig- nal processing algorithms are performed: convolutions, FIR filtering and Discrete Fou- rier Transform.
Two application domains in which the REMAP-.gamma. architecture may be used are pre- sented and evaluated: 1) Artificial Neural Networks, and 2) Phased array multi-channel radar signal processing.
The results of a VLSI test prototype chip implementation are presented, together with an analysis of the delay in chip-to-chip interconnections using MCM-(Multi- Chip-Module) mounting technology.
array signal processing
distributed synchronous clocking