Systolic modular VLSI Architecture for Multi-Model Neural Network Implementation +

Systolic modular VLSI Architecture for Multi-Model Neural Network Implementation + J.M. Moreno *, J. Madrenas, J. Cabestany Departament d'enginyeria Electrònica Universitat Politècnica de Catalunya Barcelona, Spain Abstract In this paper we shall review the basic principles to be considered when mixed analog/digital alternatives for implementing neural models are considered. Starting from a generic systolic architecture, we adapt its internal structure in order to permit the modular implementation of a wide range of artificial neural network models. After analyzing the basic computational resources required by the considered neural models, some basic building blocks have been identified and implemented. Our results show that the proposed approach is suitable for building high throughput physical realizations capable to adapt their resources so as to emulate a wide variety of neural network models. 1. Introduction Traditionally there have been two main trends related to the physical implementation of neural network models: the analog and the digital approach. If both drawbacks and benefits offered by these alternatives are taken into account, one can conclude that there is no a clear-cut advantage of one approach over the other. For instance, it is well known that compactness and high speed are basic characteristics of analog implementations. However, the lack of reliable analog memories imposes a serious drawback for this approach when learning capabilities are considered. On the other hand, digital architectures possess good flexibility and adaptability at the expense of speed and area. Mixed analog/digital implementation schemes combining the flexibility and easy memory storage from digital alternatives, and the speed and compactness from the analog approaches could provide therefore an efficient way for building practical hardware realizations capable to cover most of the requirements imposed by artificial neural network paradigms. Starting from the systolic analog/digital VLSI architecture presented in [1], we shall show in this paper how it is possible to construct a modular and compact system capable to emulate a wide range of neural network models. First we shall review the main features associated to the systolic architecture considered throughout this paper. Afterwards we shall determine, by analyzing the computational requirements of a wide range of neural models, the basic building blocks to be incorporated in the individual processing units in order to provide multimodel emulation capabilities. Some considerations about these basic building blocks will be then indicated. Finally, conclusions and future work will be presented. 2. Considerations about the architecture The architecture considered for implementing neural network models is depicted in figure 1. As can be seen, it is composed of a linear array of processing elements, which are locally connected constituting a systolic ring array. This architecture is able to emulate neural models using a distributed synapse scheme. It means that there is not a one-to-one mapping between the neurons to be emulated and the physical units in the array. In fact, at each time step only one neuron is emulated, and all the units in the array collaborate in calculating this neuron's output activation by emulating one of their synaptic connections. + This work has been partially funded by ESPRIT III Project ELENA-Nerves 2 (no. 6819) and by the spanish CICYT action TIC92-629 * Holder of an FI Research Grant under the Generalitat de Catalunya's Educational Department

Figure 1: Analog systolic A/D architecture Four different building blocks can be identified in this structure: the master unit, the processing units, the analog transfer cells (ATC) and the array common current line. A pair of analog transfer cells are placed between two processing units in the array and temporarily store the analog values associated with the input and output activations of each unit. One of the most efficient ways to implement these cells is by means of current copiers [2], which transfer analog values in current mode and store them in a capacitor connected to the gate of a MOS transistor. There are however other possible schemes for implementing these store-and-transfer units, among them the charge transfer approach [3] or the voltage transfer by means of comparators. Two ATCs are used per processing unit in order to provide, as will be explained later, the ability to emulate multilayered network structures. The internal organizations for the master unit and for the processing are depicted in figure 2(a) and 2(b), respectively. At each time step, the processing units perform the function of one synapse, while the whole array performs the function of a neuron. This synaptic connection is obtained by processing in the synaptic function building block the input activation found at the IN1 analog voltage input with the corresponding weight component obtained from the internal register bank. The resulting current is placed, through OUT1 output, onto the array's common current line, which adds together all currents so as to obtain the currently emulated unit's output potential. This output potential is then fed back to the master unit through the IN3 input, and is converted by the activation function block to the output activation for the neuron currently being emulated. Signal SEL1 disables the input to the activation function block (and creates a transmission line in the processing units) when the output potential is ready in the current line. An external shift register will control the processing unit on whose ATC the output activation will be then stored. This shift register is also used to select, before the emulation process begins, the unit where the corresponding component of the input activation is to be stored. The functionality of the synaptic function and activation function building blocks will depend on the neural model to be emulated, as will be explained in the next section. Figure 2: Internal organization of (a) master unit (b) processing units The multiplexer controlled by the SEL2 signal allows the architecture to emulate efficiently, by means of a unidimensional array of processing units, multilayered neural network structures. When emulating one layer of the neural network, the SEL2 line selects the IN1 input, corresponding to one of the input activations associated with the layer currently being emulated. After the emulation of this layer has completed, all the output activations produced are converted into input activations for the next layer. To do this, the SEL2 signal will now select the IN2 input, which corresponds to the output activation value stored in the analog transfer cell. Afterwards, a transfer

process through the OUT2 output is carried out. Thus, the new set of input activations is now ready for the next layer, which is to be emulated following this operation. As can be deduced, the array of processing units is converted into a systolic ring for the emulation of each layer. The size of the systolic ring will be, in general, different for each layer, since it depends on the number of inputs attached to each unit in the layer. However, the systolic ring size can be easily configured, due to the fact that only three lines (array common current, input and output activation transfer lines) are needed so as to close the ring to the master unit. In order to perform this dynamic configuration of the ring size, an array status register will determine the unit in charge of closing the systolic array. 3. Multi-model implementation 3.1. Operation identification in Master and Processing units Depending on the algorithm to be implemented, the operations that the master and processing units should perform can vary. Thus, each algorithm must be analyzed separately to obtain the elementary operations needed in each case. The following models shall be considered: Multi-Layer Perceptron (MLP) [4], Radial-Basis Functions (RBF) [5], Grow And Learn (GAL) [6], Restricted Coulomb Energy (RCE) [7] and Kernel- Based Functions (KBF) [8]. 3.2. Multi-Layer Perceptron The well-known Multi-Layer Perceptron (MLP) in execution phase performs the scalar product of the input vector and synaptic weight vector of the current unit. The output activation, a j, is obtained through the application of a non-linear function, g(), (sigmoid,...) to the previous result, as indicated in the following formula: j j j a = g w ix i (1) i A way to implement learning capabilities on the proposed systolic architecture for MLP-like structures generated by some neural incremental algorithms has been already described [1]. 3.3. Radial Basis Functions Regarding the implementation of RBF, RCE, GAL and KBF models we can consider that they share the common property of being constituted by a first hidden layer in charge of calculating the distance from the input vector to some exemplar vectors stored in the units composing this layer. The different features of these algorithms appear in the way the obtained distances are processed. The RBF model has two levels. In the first level, after calculating each distance, a Gaussian transfer function is applied to these values, obtaining a figure of the similarity degree between the input pattern and the stored vectors. Then, a linear combination on these values is performed in the second layer which finally provides the network outputs. A block diagram that illustrates the operation is shown in figure 3. Since the Gaussian transfer function in our architecture is serially performed by the master cell, these results are stored in the analog transfer cells (ATC) of the systolic array. Afterwards these values are used as input activations for the second layer which is also emulated in the same systolic array. 3.4. Kernel Based Functions In KBF classification, a kernel function is applied to the previously calculated distances (as already mentioned). There is an optimal kernel function depending on the number of dimensions of the input space. For instance, it is near to the Gaussian shape for 2 dimensions, but the function changes significantly for higher dimensionality. The outputs of the kernel functions which belong to the same class are added and then the obtained figures are multiplied by their respective a priori probability. Finally, a Winner-Take-All circuit decides the more likely class of the input vector. The implementation of the KBF algorithm in the execution phase is shown in figure 4.

Then, next calculated distance is stored in ATC1, and the operation is repeated until all the distances have been compared. The results of the WTA circuit are stored in a shift register, and by simply detecting the last winner, the position of a register array that contains the class code of the winner is selected, and the class is obtained. If desired, a reject factor can be easily included to avoid classification when minimum distances to different classes are similar, in order to reduce classification errors, more likely to appear in that case. 3.6. Restricted Coulomb Energy Figure 3: RBF implementation Figure 4: KBF implementation 3.5. Grow and Learn The GAL algorithm is based on the incremental partitioning of the input space by adding new units when an input pattern is not correctly classified. The execution step of the algorithm consists of a distance calculation and then a Winner-Take-All (WTA) operation to detect the minimum of the distances. The class to which the winning unit belongs can be obtained by means of a lookup table. In figure 5(a) the mapping of the algorithm to our architecture is shown. The distances calculated by the master unit are stored serially in the analog transfer cell 1 (ATC1) of the figure. The present value is compared with a previous value stored in ATC2 by a 2-input WTA (simply a comparator). If the present value wins, it is transferred to ATC2, otherwise the old value remains in ATC2. The RCE algorithm is similar to the GAL algorithm. However in this case an input pattern is said to belong to a given unit if its distance to the centroid is smaller than a given radius. This radius can be different for each unit. Thus, it is possible in this algorithm to find input space regions that belong to more than one unit, which is not a problem if those units map the same class. If they map different classes then the input pattern is not classified and the radii of the involved centroids are reduced. Empty regions can exist as well and if an input pattern appears in such a region, a new unit containing that pattern as a centroid is created. The execution of RCE can be performed by the systolic architecture as shown in figure 5(b) in a similar way to GAL. Now, the comparison is not done between distances. Instead, each distance is compared with the corresponding radius. When the input vector appears to belong to more than one class, or does not appear in any influence region, it is not classified. 3.7. Summary of operations In Table 1 the operations needed to implement the different algorithms are summarized and grouped as synaptic (all cells of the systolic array), activation (only master cell) and external (to the systolic array) functions. 4. Basic Building Blocks From the analysis carried out in the previous section about the basic operations required by the

different neural models considered, three main basic blocks have been identified: multiplier (needed to carry out the vector-vector and vector-scalar product), substractor/squarer (euclidean distance calculation) and exponential function generator (gaussian function for RBF networks and Kernel function for KBF networks). needed to implement the required capacitors could also become quite large. Algorithm Synaptic function Activation function External function MLP Product Sigmoid -- RBF Distance, Prod. Gaussian -- GAL/RCE Distance WTA -- KBF Distance Kernel Product, WTA Table 1: Summary of needed operations for each algorithm Regarding the implementation of the euclidean distance calculator, the first step consists of substracting the digital synaptic weight from an analog input activation. This operation can be easily implemented by connecting the output of two current mirrors. The first mirror just copy on its output the current value given as input, while the second one performs a weighted sum of currents obtained from a transistor ladder (with W/L values scaled by a factor of 2), on which transistors are selected depending on the value associated to the corresponding synaptic weight bits. Once the substraction has been performed, the output current can be squared using the cells presented in [9]. There have been several proposals for implementing the analog/digital multiplication scheme (see [10], for instance), but the very limited dynamic range makes them suitable only for very specific applications. In order to allow for a larger dynamic range, we have adopted a different alternative. The basic idea is depicted in figure 6(a), and corresponds to the classical resistor ladder D/A conversion scheme. The main problem associated with this approach is that resistors are inaccurate and area consuming devices when a VLSI implementation is considered. So as to provide compact and reliable physical implementations, the resistors in this figure can be realized by means of switched capacitors. In this way, we could easily implement the A/D multiplier of figure 6(a) by replacing the resistors R,..., 8R by switched capacitors of values C,..., 8C. However, if the number of bits required to store the weight is large (as is usually the case for most common neural learning algorithms), the silicon area (b) (a) Figure 5: (a) GAL algorithm implementation, (b) RCE algorithm implementation

Figure 6: (a) A/D multiplier principle and (b) resistor ladder implementation This space requirement can be avoided if we change the clock frequency associated with each capacitor switch. Using this strategy, figure 6(b) shows how the resistor ladder of figure 6(a) can be implemented using four capacitors of value C,C,C and C/2 and a 3-bit counter. The problem with this strategy is that, in reducing the frequency applied to the capacitor switches, we are also limiting the maximum frequency allowed for the input voltage. However, our simulation results show that it is possible to reach an input bandwidth of 50 MHz with reasonable system sizes (unity capacitors of 1pF require about 38x38 mm 2 in the CMOS 1.2 mm double polysilicon process we have used). Finally, the cell we have used for implementing the exponential function generator is depicted in figure 7. As can be deduced, it is based on a flash A/D conversion scheme (resistor ladder and comparators on the left side of the figure). However, in our case we convert the thermometer code given by the comparator array to an output current, which should approximate the desired exponential function. This current is obtained by adding the individual current branches in the output current mirrors depending on the output values given by the comparators. In order to avoid the tolerance provided in the W/L ratios for the transistors composing the output current mirrors, we have adopted a non uniform quantization approach, which consists in using resistors with different values in the input ladder. In this way we can use the same W/L ratios for all the transistors in the output stage. In order to hold the required area as small as possible, the comparators are based on a simple differential pair. A precision of 6 bits (64 comparators) is enough for a total approximation error bellow 2%. Figure 8 depicts the simulation results corresponding to the output current obtained for a gaussian function generator cell in response to an input current ramp, with an equivalent bandwidth of approximately 40Mhz. However, by modifying the input resistor ladder, any monotonic function can be constructed using the scheme proposed for this cell. Figure 7: Exponential function generator 5. Conclusions and future work In this paper we have reviewed some of the principles to be considered for a mixed analog/digital implementation of neural network models. Starting from a general systolic analog/digital architecture, we

have indicated the way its internal organization can be modified in order to permit the modular implementation of a wide range of neural network models. Once the modular principle has been stated, a careful analysis has been performed on the different neural models considered. The goal of this analysis consists in identifying the basic arithmetic resources required to implement these models, so as to detect the basic cells which may be shared by them on a general multi-model realization. Some basic building blocks have been developed according to the functional requirements imposed by the systolic architecture considered and by the particular arithmetic schemes the emulated models have to fulfill. Since the basic cells required to build the system are very compact and fast, the modular implementation principle used for the systolic architecture permits two alternatives for the final physical realization: high density systems, with a large number of processing units per chip, but capable to emulate only a few neural models, or low density systems, but with more complex processing units and thus capable to emulate a wide range of neural models. 305 ua Current Time 1.6 useg. Figure 8: Simulation results for the gaussian function generator VLSI Architecture for Implementing Neural Network Models", IEEE Micro, Vol. 14, No. 3, pps. 51-59, June 1994. [2] S.J. Daubert, D. Vallancourt, "Operation and Analysis of Current Copier Circuits", IEE Proceedings, Vol. 137, Pt G, No. 2, pps. 109-115, 1990. [3] "Charge-Coupled Devices: Technology and Applications", R. Melen, D. Buss (eds.), IEEE Press, 1977. [4] D.E. Rumelhart, J.L. McClelland, "Parallel Distributed Processing: Explorations in the Microstructure of Cognition", MIT Press, 1986. [5] D.S. Broomhead, D. Lowe, "Multivariate Functional Interpolation and Adaptive Networks", Complex Systems 2, pps. 321-355, 1988. [6] A.I. Alpaydin, "Neural Models for Incremental Supervised and Unsupervised Learning", Ph.D. Thesis, EPFL Lausanne (Switzerland), 1990. [7] D.L. Reilly, L.N. Cooper, C. Elbaum, "A Neural Model for Category Learning", Biological Cybernetics 45, pps. 35-41, 1982. [8] E. Parzen, "On Estimation of a Probability Density Function and Mode", Ann. Math. Stat. 33, pps. 1065-1076, September 1962. [9] K. Bult, H. Wallinga, "A Class of Analog CMOS Circuits Based on the Square-Law Characteristic of an MOS Transistor in Saturation", IEEE Journal of Solid State Circuits, Vol. 22, No. 3, pps. 357-365, 1987. [10] P.W. Hollis, J.J. Paulos, "Artificial Neural Networks Using MOS Analog Multipliers", IEEE Journal of Solid State Circuits, Vol. 25, No. 3, pps. 849-855, 1990. We are currently performing exhaustive analysis on the precision to be used in the digital section of the basic cells, according to the convergence constraints of the learning algorithms used by the considered neural models. The basic cells presented throughout this paper are also undergoing the design and characterization phases. 7. References [1] J.M. Moreno, F. Castillo, J. Cabestany, A. Napieralski, J. Madrenas, "An Analog Systolic