CHAPTER 4 GALS ARCHITECTURE - PDF Free Download

64 CHAPTER 4 GALS ARCHITECTURE The aim of this chapter is to implement an application on GALS architecture. The synchronous and asynchronous implementations are compared in FFT design. The power consumption is reduced and the maximum frequency of operation increases when the FFT design is implemented in GALS architecture. 4.1 CLOCK DISTRIBUTION Generating high frequency clock signal throughout the die is a problem as skew becomes inevitable. In most processors, a PLL (Phase Lock Loop) generates a high frequency clock signal from a slow external clock. There are two approaches to distribute the clock: H - Tree distribution Buffer driven clock distribution A combination of a metal grid and tree of buffers are used to distribute the clock throughout the chip. These metal grids provide an ideal path for low skew as they provide direct interconnections between clock pins. In most of the clock hierarchy, there is a transition from global clocks which are sub-divided into major clocks and finally to local clocks. This novel approach helps in reducing the skew within the synchronous blocks itself (Restle et al 2001). The basic distribution of global clock into smaller independent local clocks is shown in Figure 4.1.

65 Figure 4.1 Global clock distribution The H-Tree structure has low latency, less wiring and low power consumption. Trees work well if the clock loading is uniform across the chip area; unfortunately, most microprocessors have widely varying clock loads. This is the main reason why H-type clock distribution scheme is used for asynchronous circuit design. The advantage of H-tree structure design is the fact that ideal duty cycle is reduced by 50% and reflection is prevented. Rise and fall times are preserved. There is reduced load capacitance which results in lesser fan-out for our design. In buffer driven clock distribution the buffer acts as a driver to the integrated circuit. Buffers are to amplify and reduce the skew between the intermediate clock domains (Olsson et al 2000). Using PLL to generate clock might pose a significant problem when it comes to pausable clock. So we use a ring oscillator to generate clock and distribute it through the chip area. PLL produces more mechanical noise compared to ring oscillator. Crystal/ring oscillator is used for clock generation. The local clock generator is constructed from an inverter and a delay line, similar to an inverter ring oscillator (Nilisson et al 1996). The problem with using inverters alone as a delay line is that it is difficult to accurately tune the clock period as process variations and changes in

66 temperature affect the delay. Hence, accurate delay lines have been developed which are capable of maintaining a stable clock frequency. These use a global reference clock for calibration. The former can use either standard cells or full custom blocks for the tunable delay and have been shown to maintain a frequency within 1% of the chosen value or Pierce crystal oscillator with stable frequency is used for clock generation. 4.2 GALS LOCAL CLOCK CIRCUITRY DESIGN Local clock generation is carried out using on chip clock generators. This design is governed by multiple clocks. Since we have multiple clocks, synchronization of these clocks is a major a problem as failures are more prone under this configuration. Globally asynchronous locally synchronous (GALS) systems combine the benefits of synchronous and asynchronous systems (Chapiro et al 1984). Modules can be designed like modules in a globally synchronous design, (Muttersback et al 2000) using the same tools and methodologies. Each block is independently clocked, which helps to alleviate clock skew. Connections between the synchronous blocks are asynchronous. When data enters a synchronous system from an asynchronous environment, registers at the input are prone to metastability. To avoid this, the arrival of data is indicated by an asynchronous handshaking protocol. When data arrives, the locally generated clock is paused: in practice, the rising edge of the clock is delayed. Once data has safely arrived, the clock can be released so data is latched with zero probability of metastability on the data path. Early work on GALS systems (Yen et al 1998) introduced clock stretching or pausing. Synchronization is done using two methods. One is stretchable clock and the other is pausable clocking methodology. In stretchable clocking

67 scheme, the clocking period is unknown during the stretch time to attain synchronization, but in the pausable clocking scheme, the period after the pause is known and much of synchronization failures can be averted. Due to the inconsistency of the stretchable clock, we take into consideration the pausable clocking scheme for our synchronization methodology. We have proposed a new asynchronous communication methodology for the communication between any two synchronous blocks. Methods to achieve asynchronous communication: Handshake circuits clocking versus handshaking Muller C elements Stretchable clock/pausable clock 4.2.1 Handshake circuits The handshake circuits are used to generate signals in response to their input signals (from environment or so called dummy signals) which in combination form a specific protocol to implement asynchronous data communication between two modules. The design of the handshake circuits depends on the specific data communication protocols, the structures of modules and the organization of systems. The asynchronous handshake is shown in Figure 4.2. Two or four-phase protocols and various data encoding schemes are used for asynchronous data communication. The main disadvantage of the 2-phase handshake protocol is that its state differs from the state before that handshake. A 4-phase consists of two transitions more to return to the initial state. This can be seen in Figures 4.3 and 4.4. In comparison with the 2-phase handshake protocol, a 4-phase is slower and consumes more power. In general, the circuits are simpler and less

68 expensive. Due to the complexity of control circuits for implementing 2-phase protocols and the large overhead for dual-rail data transference to generate the completed signal. The four-phase bundled data communication is commonly employed in most of the fully asynchronous systems. Other types of handshaking are also possible. The circuits which are proposed, has only one wire with active circuits at both ends to pull the wire up or down. The introduced protocols above all assume that the sender is the active party that initiates the data transfer over the channel. This is known as a push channel. The opposite, the receiver asking for new data is also possible and is called a pull channel. The direction of the request and acknowledge signals are then being reversed. The validity of the data is indicated in the acknowledge signal from the sender to the receiver. Figure 4.2 Asynchronous handshake

69 Figure 4.3 Two phase circuit Figure 4.4 Four phase circuit

70 4.2.2 Muller C-element The Muller C element is an important state holding component of asynchronous circuits. When both inputs are 0, the output of the Muller C element is set to 0 and when they are 1, the output changes to 1. Since handshaking involves cyclic transitions between 0 and 1, it is clear that the Muller C element is a fundamental component and is the AND function for two events. The fundamental circuit implementing four-phase handshaking protocol between asynchronous blocks can be composed of Muller C elements which are shown in Figure 4.5. The Muller C element truth table is shown in Table 4.1. For a successful data transfer, the handshake circuits will go through the following signal transitions viewed from outside of the interface circuits: Rin +, Aout +, Rout+, Ain +, Rin -, Aout -, Rout - and Ain - by which data communications in the whole system are consecutively coupled with behaviors of subsequent modules. Figure 4.5 Muller C element circuit diagram

71 Table 4.1 Truth table for Muller C element A B Y 0 0 0 0 1 No change 1 0 No change 1 1 1 4.2.3 Stretchable clock controller The main motivation behind machines with stretchable clocks has been to avoid the metastability problems. Stretchable clock can stretch a clock phase for an unbounded period of time. In the meanwhile, inputs and outputs will become valid (Moore et al 2000). Therefore, they are suitable for interaction with the global asynchronous characteristics of a GALS. The stretchable clock consists of a ring oscillator and a Muller C element as shown in Figure 4.6. For safety and reliability of the clock, a Muller C element is been used. If Stretch is not asserted as low, the output and inputs of the C element will follow the signal transitions of Figure 4.7. If Stretch is asserted as high, the input X a is set to low, the output of the C element could be either low or high. The output will eventually be maintained at a low level. The next rising of lclk edge has postponed the Stretch +. An OR gate is used for multiple requests for stretching the clock.

72 Figure 4.6 Implementation of a stretchable clock Figure 4.7 Stretchable clock signal transition graph 4.2.4 Pausable clock controller To make the clock pausable, an ME element is added to the ring as shown in Figure 4.8. This arbitrates between the rising edge of the clock and an incoming request (Yen et al 1996). Hence, the clock is prevented from

73 rising as the input registers are being enabled by the request and metastability is prevented. For each bundle of data, a port controller, request and ME element is required. Only when all of the ME elements have been locked out by the clock, the rising clock edge is permitted to occur. Figure 4.8 Clock pausing scheme A problem with clock pausing is that the clock is delayed as it is distributed across the clock domain, but the clock must be paused at its source (Yen et al 1999). When the clock releases the ME elements, there may still be a clock edge in the buffer tree. Hence, it is possible that registers will be enabled as the clock rises, as shown in Figure 4.9. However, while the source clock is high, ME elements will remain locked so for this phase of the cycle, no requests are permitted. For this reason, we must ensure that the delay of the clock buffer is shorter than the duration of the high phase of the clock. Limiting this delay limits the size of the clock tree, hence defining the size of GALS blocks.

74 Figure 4.9 Clock buffering problem The standard components which can be placed around synchronous modules to provide the handshake signals and make them GALS modules, are called asynchronous wrappers (Carlsson et al 2002). In GALS systems, interface circuit with Muller C elements can not be applied directly to data communications between two locally synchronous modules, because there is a significant difference between GALS and fully asynchronous systems that the activations of data input and output must be synchronized with their local clocks. Hence, for GALS systems to complete data transfers between the locally synchronous modules, it is necessary to use special ports (Shengxian et al 2002) to implement the handshaking and the stretching of local clocks. Such LS modules with an asynchronous wrapper could be connected with ease (Hanck et al 1994). Port controllers are required to generate and accept handshaking signals at the inputs and outputs of modules. These port controllers are asynchronous state machines, which are similar to inputs rather than a clock. To simplify the design of the asynchronous wrapper, but without loss of generality, we assume the handshake circuits in GALS work in the following mode:

75 a) The request of data communication is always activated by a data output interface circuit, namely W-port, which is equipped to a master LS module, The data input interface circuit, namely R-port, which is equipped to a slave LS module, is always passive for accepting the data. b) When the W-port activates data output, it might stop its internal clock and wait for the acknowledgement from the corresponding R-port. Likewise, when the R-port initializes reading a data, it must maintain the state until the W-port sends a request. This means every activation of each port completes an effective data transmission. c) Both the W-port and the R-port are independently enabled by the internal requests from their own LS modules. This is the case of data communications in many GALS systems. 4.3 IMPLEMENTATION OF FFT USING GALS TECHNOLOGY FFT processors are involved in a wide range of applications today. Not only as a very important block in broadband systems, digital TV, etc., but also used in areas like radar, medical electronics and also in real time systems (Torkelson et al 1996). The workload for FFT computations is also high and a better approach than a general-purpose processor is required, to fulfill the requirements at a reasonable cost. For instance, using application specific processors, algorithm specific processors or ASICs could be the solution to these problems. ASIC is the choice because of its lower power consumption and higher throughput. In synchronous circuits, the clock signal switches at every clock cycle and drives a large capacitance. As a result, the clock signal is a major

76 source of dynamic power dissipation. For example, studies have attributed upto 30% of the total power dissipation in general purpose processor to clock network (Hemani et al 1999). Globally asynchronous and locally synchronous circuits are one in which, different block work asynchronously with respect to the other blocks but synchronously. Asynchronous circuits activate components that are necessary to perform the given operation by the use of local handshake protocols. ones, like: Asynchronous circuits have many advantages over synchronous Performance of an asynchronous system depends on the average case latency, not the worst case latency as in synchronous circuits. Global clock timing problems are avoided. The power consumption can be lower in asynchronous circuits, despite the fact that they require more hardware. 4.4 FFT ALGORITHMS The mixed radix algorithm is used in this implementation. Any composite number N can split in to a product of prime factors. In each stage, a single prime factor is taken into consideration. The only way to calculate FFT for a prime number of point is through DFT. This stage involves calculation of the prime factor point DFT, with the proper selection of points. This step is followed by proper twiddling at this stage of output. The same procedure is repeated for other prime factors. The net result of this process yields the FFT of the N-point input. The reordering algorithm is used to obtain the FFT output in the natural order. Thus, this is the final order of output obtained by a

77 FFT algorithm and by knowing this, the output is rearranged to obtain the output in natural order. 4.5 FFT ARCHITECTURE are listed below: There are various architectures present to implement FFT and they Array architectures: Used for short FFTs computations and it requires extensive chip area. Column architecture: It is done by collapsing all the columns in array architecture into one column Pipelined architectures: Pipelined architectures are useful for FFTs that requires high data throughput. Reconfigurable architecture: It is useful for FFTs that require FPGA and the implementation requires less area but less data throughput. New FFT architecture: GALS implementation to reduce power consumption. The array architecture can only be used for very short FFTs, because of the extensive use of chip-area. The column architecture uses an approach that requires less area on the chip than the array architecture. The architecture is still not small enough to be taken into account for long FFTs. The advantages of pipeline architecture are high data throughput, relatively small area and a simple control unit. If the design uses synchronous operation, delays had to be added between each radix-2 2 stage and between

78 the two butterfly elements inside the radix-2 2 stage. This will result in the need for more hardware and increase in the latency between input and output frames. The latency is not a big problem, but the extra hardware will increase the die size and the power consumption. The advantage of reconfigurable architecture can be implemented using FPGA (since it has dynamic reconfiguration) and the data throughput is less but the area is less compared other architectures. The above problem could be solved in two ways, either changing the control unit or creating a system consisting of locally synchronous blocks communicating asynchronously (GALS) (Jonas et al 2003). The first choice of keeping the FFT completely synchronous would increase the complexity of the control unit, resulting in a system that is harder to understand. The second choice would only have a slightly different control structure, but very similar to the original one. The blocks would also be more separated from each other functionally, which could be a good property when improving the design later in the future. These pros and cons lead to the implementation of the FFT processor as a GALS-system. 4.6 GALS IMPLEMENTATION The implementation FFT using GALS architecture consists of three basic building blocks and they are: Input block Central block Output block

79 4.6.1 Features of architecture The implemented processor is a 16-bit floating point processor (8 bits for integer and 8 bits decimal part) for real and imaginary parts to enable high range and resolution. The resolution of this processor is thus 2-8 = 0.00390625. This FFT processor can handle from 2-point FFT to 32-point FFT. The processor, which makes use of a single multiplier, is implemented using modified Booth s algorithm. Only one block is active at a time in the processor, thus reducing power consumption. The input block, after getting the inputs, sends a request signal to the central block and places the data in the bus which goes to central block. The same process is followed in between the central block and the output block. To enable the features of asynchronous circuits, each block has dedicated request and acknowledge pins. The sine and cosine functions are implemented in this module. Hence, twiddle factors required are generated using these functions.

80 4.6.2 Input block The inputs to the processor are provided through this block. The input to the FFT processor is 32-points. If all the 32-points are to be provided as input simultaneously, the processor must have 32 16 2 = 1024 pins for data bits alone, which is not practically feasible. The input block consists of the following pins: 16-bit data lines, 5-bit selection lines, 5-bit selection1 lines, Global Enable and Local Enable Pins (active high), Clock and Start Pins, 64 16 output lines, request and acknowledge handshaking signals to communicate with the next block. The internal structure of input block is shown in Figure 4.10. Figure 4.10 Internal structure of input block

81 Global Enable Pin should be given high throughout the process of FFT computation. Local Enable Pin is given high only when providing inputs to the processor. The start signal is given after feeding all the inputs to the processor for FFT computation. The input block then sends a request signal and places the data of 64 32 bits at the output pins. The input block after getting the acknowledge signal from the central block, turns the request signal low. 4.6.3. Central block This block, also called the FFT computational block, is the heart of the processor. This block receives its input from the input block and outputs the result to the output block. This block is enabled by the input block by sending a request handshaking signal. Following the completion of the computations, this block sends a request signal to the output block. The inputs to this block are 64 16 data lines, Global enable, clock, select1 lines and the handshaking signal pairs (reqa, acka) and (reqc, ackc). This block comprises of three sets of 64, 32-bit registers. This block generates the twiddle factors by using the dedicated sine and cosine function generation components. The central block s main operation is to select the inputs from the registers and the corresponding twiddle factor generation using the sine and cosine generating components for computing the prime factor-point DFT. This operation is performed by using three loops to reduce the size of the program. The first loop is for specifying the stage of the FFT Computation. The second loop is for specifying the set in the stage specified by the first loop. The third loop is for specifying the point to be selected in a specified set.

82 This FFT computation block then enables the output block by sending a request signal. The output block samples the data produced by the computational block after giving the acknowledge signal. 4.6.4 Output block The positive edge in the acknowledge pulse is the instant at which the data from the central block is sampled and stored in the temporary pages. This block is utilized to display the output. It receives input from the central block It consists of input lines (64, 16), output lines (1, 16), global enable, select lines (5), select1 lines (5), clock signal. In both real and imaginary parts, 8 bits are allocated for integer part and 8 bits for mantissa. The internal structure of output block is shown in Figure 4.11. Figure 4.11 Internal structure of output block

83 4.7 IMPLEMENTATION OF GALS The implementation makes use of the GALS technology to interface the input, central and the output blocks. The input and the output blocks are made to run at the same clock speed. The central block runs at a different clock, thus moving compatible with the GALS technology. No global clock is used in asynchronous circuits; instead some form of handshaking is used in the communication between systems. These systems are not completely asynchronous; they consist of synchronous sub-systems communicating asynchronously. In Figure 4.12, the LS-system is a locally synchronous system, Req is short for request and Ack is short for acknowledge. Req and Ack perform the handshaking. In this project, a push communication channel is used, which means that the producer of data initiates the handshaking. A 4-phase handshaking cycle is performed in the following way (Req + means Req goes high): Req +, Ack +, Req - and finally Ack -. Data should be valid between Req + and Ack -, but is often sampled on the Ack + edge. Figure 4.12 GALS asynchronous communication The GALS technology is implemented in the three blocks of the FFT processor as follows: The data transfer is always along a single direction i.e. the data always flows from the input block to the central block and then

84 from the central block to the output block. So the input block is designed with a single pair of request and acknowledges signals. The central block has to communicate with input block as well as the output block and the data transfer is only one way. Hence, the central block is designed with two sets of handshaking signals (request and acknowledge signals). The output block communicates only with the central block and the data transfer is only one way and for this reason, output block is designed with a single pair of the handshaking signals. The GALS FFT processor and variable point FFT architecture are shown in Figures 4.13 and 4.14 respectively. Figure 4.13 GALS FFT processor implementation Figure 4.14 Variable point FFT architecture

85 The input block and the central block communicate through the reqa (request) and the acka (acknowledge) signals. The reqa signal is raised by the input block and the central block responds by raising the acka signal and sampling the data from the data bus. The input block then turns the reqa signal and the central block follows this by turning the acka signal low. This same process applies to the communication between the central block and the output block. The signals involved in this asynchronous communication are reqc and ackc handshaking pair of signals. 4.8 SIMULATION RESULTS The implementation of all the design blocks are implemented using VHDL code and simulation result is shown in Figure 4.15. Figure 4.15 Output waveform of GALS implemented FFT processor

86 4.9 SYNTHESIZED REPORT The FFT processor is synthesized for 8-points. Since this is a variable point processor, the number of loops i.e., the corresponding number of operations varies with the number of points. The synthesized 8-point FFT result is shown in Figure 4.18. The tool used for synthesizing is Leonardo Spectrum. The report obtained by synthesizing the 8-point FFT processor, which is taken as snap shot from the system monitor, is shown in Figure 4.16. Figure 4.16 Synthesized report for 8-point FFT processor

87 This synthesized report is obtained before the implementation of GALS in the processor. The LC utilization as from the report is 5010 and for Altera APEX20KE technology, it is found to be 60.22%. The clock frequency report generated by the tool Leonardo spectrum, which is also obtained as a snap shot, is shown in Table 4.2. Table 4.2 Clock frequency report The clock frequency report gives a clear picture of the frequency of operation of the blocks and it is shown in Table 4.2. The input block and the output block can operate at a maximum frequency of 64.0 MHz. The central block can operate at a maximum frequency of 13.7 MHz. Since the central block involves numerous computation processes, this block would operate slowly with respect to others. When the time period taken for computing FFT for a set of eight points, then the frequency of operation is determined by the input and output blocks, since the time period for eight clock cycles for a 64 MHz clock is greater than the single clock period of a 13.7 MHz clock.

88 The RTL schematic diagram obtained using the Leonardo Spectrum tool is shown in Figure 4.17. Figure 4.17 RTL schematic of 8-point FFT processor

89 The technology schematic diagram is shown in the Figure 4.18. Figure 4.18 Technology schematic of 8-point FFT processor 4.10 CONCLUSIONS The various modules are implemented using GALS technology. Each module will run with different frequency. If the modules are operating synchronously, the critical path will be further reduced since it has to take care of input block delay, central block delay and output block delay (The total delay = 1/64 + 1/64 + 1/13.7 = 9.6 ns). The throughput will be reduced.

90 If the module is operated with 64 MHz, then the central block will take few clock cycles to finish the operation. The delay of the whole block will be 15.625 ns. If the FFT design uses GALS approach, the power required will be (P α f i.e. P1 α 64 and P2 α 13.7) 13.7/64 = 21 % of the synchronous operation. The speed of each block is shown in Table 4.3. Table 4.3 Speed of operation of various blocks Sl. No. Description Delay (speed of operation) 1 Input block GALS implementation 64 MHz 2 Output block GALS implementation 64 MHz 3 Central block GALS implementation 13.7 MHz 4 Synchronous implementation 9.6 MHz Another advantage is that there will not be any clock skew, since the design is divided into three blocks and each block is operating at different frequency.