Multidimensional Piecewise Regular Arrays
Regular arrays, particularly systolic arrays, have been the subject of continuous interest for the past 15 years. One reason is that they present an excellent example of the unity between hardware and software, especially for application-specific computations. This results in a cost effective implementation of systolic algorithms in hardware, in VLSI chips or on FPGAs. To the present time, systolic/regular arrays have primarily been considered as 2-D structures.
The chief purposes of this work are: (i) to develop methods to transform an algorithm into a form that fits the 3-D physical construction of the processor and is easy to fabricate; (ii) to find ways of increasing the available degree of parallelism and thus improve scalability and latency. For this purpose, we propose multidimensional piecewise regular arrays: arrays of loosely connected subarrays of lower dimensionality where two different clock rates are used. One, a high clock rate, is used inside subarrays, e.g. inside VLSI chips, and the other, a low clock rate, is used in the interconnection part of subarrays.
These properties permit easy physical realization of n-D large arrays, as the n-D array is formed from (n-1)-D subarrays that are connected to each other only by edges using a low clock rate. Thus, 3-D arrays that consist of 2-D arrays are easily fabricated, e.g. using multichip modules, wafer scale integration etc.
While several of the approaches that we use to achieve our aims have been considered in the literature, they have unfortunately been studied separately and without a unified approach. We combine our approach with commonly used synthesizing methods for regular arrays: with space-time transformations on polytopes. The approach we propose can be used for all associative and commutative problems.
The thesis presents the synthesis of large variety of new, higher-dimensional arrays. The two main issues involved in addition to the existing methods in the polytope model are: (1) In order to achieve a higher degree of parallelism, and to decrease latency, we increase the dimensionality of the source representation of the program by partitioning the range of indices. (2) We introduce a method for developing pipestructures (an extension of pipelines) for spreading the shared data between distinct computations and for gathering partial results in the case of a reduction operator.
As an example, we consider template matching on systolic arrays. A 2-D mesh of linear arrays --- conventional systolic arrays for 1-D convolution --- that exploits two different clock rates is presented.
piecewise regular arrays