Concepts of Parallelism In An Introductory Computer Architecture Courses With FPGA Laboratories

Concepts of Parallelism In An Introductory Computer Architecture Courses With FPGA Laboratories Sally L. Wood 1, Chris Dick 2 Abstract The introductory course in computer architecture or machine organization required of most electrical and computer engineering students has evolved substantially in recent years as technological advances have led to ever increasing processor sophistication. In most cases the introductory course is still built around traditional instruction set architectures (ISA) using actual or simulated processors. However the future will require that working engineers be able to effectively use highly integrated distributed arrays of computational resources. With the widespread use of field programmable gate arrays (FPGA) in student laboratories, it is now possible to introduce basic concepts of parallel structures, such as those used in special purpose high performance graphics processing or digital signal processing, without confronting the complex communication and synchronization issues associated with arrays of processors. An introductory architecture course has been modified to include concepts of parallel structures as well as traditional ISAs. The same FPGAs that can be used to create a simple processor with a basic instruction set can also be used to implement simple structures for parallel computation. Although the design methodology and performance evaluations for these parallel designs are not as mature as the traditional ISA based design, it is still possible to introduce perspectives of parallel design in the introductory course. This paper describes the course and some of the laboratory experience. Index Terms digital system architecture, FPGA, parallel operations, digital filters INTRODUCTION A first course in machine organization is found in the lower division of many electrical and computer engineering curricula. It usually follows a programming course and a logic design course, and it is intended to provide a basic introduction to the architecture and input/output interface of a simple general purpose processor. With growing interest in mechatronics and instrumentation control, such a course may also be valuable to mechanical engineering students. A more advanced treatment of computer architecture is usually found in an upper division computer engineering course. The content of the machine organization course has evolved as technology has changed the way general purpose processors are designed and used. Twenty years ago the course would have focussed on assembler and machine level instructions for a simple Intel or Motorola processor, and students would have used assembly language programming for microprocessor projects. Hardware multiplication, if available, would have required an additional mathematics coprocessor chip. As semiconductor technology advanced following Moore s law, processors and instruction sets grew in complexity, and compilers became more sophisticated. The focus of the course shifted away from proficiency in an assembly language in order to write better code than a compiler. Understanding alternative architectures for general purpose processors and knowing how higher level language instructions were interpreted for a traditional ISA became more important. The course typically included principles of design for data paths, register transfer and register files, design of computational components, and design of control units. Today an introductory machine organization course is often the preparation for understanding and using of a wide variety of processors, including the simplest of microcontrollers, embedded system applications, complex general purpose processors, and processors tuned for specific applications such as digital signal processing. The diversity of processors has grown as a result of optimizing the use of the ever increasing capability of logic and memory for a wide range of applications. A number of different architectural approaches to improving performance using traditional ISAs each imposes its own corresponding layers of complexity. For example, registers can be added for pipelining to allow some parallel use resources at the expense of restricted ordering of instructions. VLIW processors, such as the TMS320C6711 [9] include multiple processors with different ISAs optimized for different subsets of instructions. Efficient parallel execution on these processors is best achieved with a good compiler. At a higher level, use of multiple processors in parallel requires effective strategies for partitioning tasks and managing communication between processors. However, at a much lower level of the architecture, circuits can be designed for parallel implementations of computations traditionally implemented with a sequence of instructions. FPGAs, which were originally used primarily for glue logic, now have distributed memory and multiplication resources that are capable of handling computational tasks 1 Sally L. Wood, Professor, Electrical Engineering, Santa Clara University, Santa Clara, California, 95053 swood@scu.edu 2 Chris Dick, Signal Processing Group, Xilinx, Inc., 2100 Logic Drive, San Jose, California, 95124, chris.dick@xilinx.com F4C15

directly. This leads to an entirely new perspective for the tradeoffs in architectural approaches for computationally intensive applications in the context of current technology. NEED FOR BROADER PERSPECTIVE WITH HDLS AND FPGAS FPGAbased computing offers the possibility of dramatically extending the capability of traditional von Neumann machines based on the concept of stored instructions and a program counter to emulate a datapath. The von Neumann architecture has been very successful over the past 50 years and has been applied to several generations of computers, but it reflects a set of optimizations based on assumptions that become less valid as semiconductor technology progresses. For example, the functional units of the ENIAC [6] were physically separate with electomagnetic memory components and vacuum tube switches for computation and control. The separation of the memory and the computation in the computing model with regulated transfers between them realistically reflected the physical implementation. Semiconductor components soon replaced other forms of memory and computational units, but for many years the memory and computational circuits were still separate components. Computation was still an expensive resource, often found in an optional separate component. Algorithms to minimize or avoid multiplication, such as the FFT [1] or Bresenham s algorithm [3], were useful. Although we still use the von Neumann model today, the physical structure of computational hardware is fundamentally different. The functional units are inexpensive and highly integrated. For example, a current stateoftheart FPGA like the Virtex Pro XC2VP125 [12] supplies 556 18bitby18bit embedded multipliers. Methods of realizing the full potential of massive parallel arrays of computing elements are needed, but this is rarely included in engineering curricula. A major issue in the context of microprocessors is the difficulty of realizing mechanisms for efficiently utilizing the vast array of compute nodes when bounded by the operating principles of the von Neumann machine. The question of writing effective compilers for this combination of constraints also becomes an important issue. FPGAs remove the structural limitations of the ISA machine and allow designers the freedom to build custom datapaths to solve a specific problem rather than using a general purpose datapath to emulate the desired datapath. To add this option to the general set of design alternatives included in a university curriculum, a design methodology is needed for the emerging use of spatial computing. The lack of visibility in current curricula is partially due to the relatively short amount of time that reconfigurable devices have been used for mathematically intensive applications. A significant issue that must be addressed is the lack of FPGA design flows for computationally intensive applications such as signal processing and communications. Until recently, using an FPGA for these applications has required intimate knowledge of hardware description languages (HDL), but a higher level design methodology is needed. In addition, a more visually intuitive design tool is desirable to emphasize the spatial or parallel structures of a design. FPGA tool flows have changed in recent years, and practicing engineers and classroom laboratories now have access to design flows that allow the designer to work in the language of the application [2]. In the case of signal processing, Matlab with Simulink [5] from The Mathworks is a common framework for providing students with experience coding and simulating signal processing and communications concepts. This same environment can now be used to produce an implementation in an FPGA. With this capability, we can include some basic alternatives to ISA based processing in an introductory architecture course at an appropriate level of complexity. CURRENT COURSE CONTENT The "Introduction to Digital Systems Architecture" course is a sophomore level required course for electrical engineering students who have already taken a course in logic design and a course in C language programming. For many students who choose their electives in other areas, this is the last course they will take in the area of digital design, so it is important that the course include exposure to new design approaches that are likely to significantly impact future applications. This course has evolved from a course with a heavy emphasis on assembly language programming of a specific microprocessor into a more general course on principles of machine organization. Although the course emphasizes basic principles of datapath designs, register transfer, register files, arithmetic units, instruction decoding, and instruction cycle control logic, some use of a specific assembler language, or a subset of a language, is used to complement the generic instructions. Several years ago this course was changed so that the examples of architectural design elements and instruction sequences were based on the implementation, but not design, of FIR and IIR digital filters. This approach had several advantages. Digital filtering provided a structured continuing example for use throughout the course to illustrate basic concepts. The general mathematical computations using vector dot products demonstrated addressing modes, datatypes, and use of queues and stacks. The use of audio inputs and outputs as the interface examples added interest for the students while demonstrating data type conversions and interrupt handling. This also helped demonstrate the implications of the architecture on realtime applications as well as alternative input and output methods [10]. A third advantage is that this course became more complementary to the second analog circuits course, usually taken at the same time, where the students analyzed frequency selective analog circuits. Now the first 25% of the course covers traditional datapath and control logic topics. It is supplemented by laboratories in which students design an FPGA implementation of the instruction decoding and register transfers for a very simple single cycle architecture with a very limited instruction set using only immediate and register addressing modes. Students use the same design tools and FPGAs that were used in the prerequisite logic design course, so these laboratories are an extension of previous activities. From this experience they gain a basic understanding of F4C16

processor structure and a framework for using specific microcontrollers in future projects. Instruction sequences are created to compute the Fibonacci sequence or the sequence of squares of integers. For the remainder of the course the TMS320C6711 has been used for the laboratories replacing the earlier TMS320C30. The change to the VLIW processor created some problems due to the multiple processors and the complexity of parallel use. The compiler complexity of assigning different instructions to different processors is not explored in detail. However, the parallel instructions provide the opportunity to compare slightly different ISAs and explore the tradeoffs made to improve performance in one area at the expense of another. To further improve the scope of this course, a new section comparing the use of traditional processors to parallel implementation alternatives using modern FPGAs was planned. This would move the parallel utilization of resources to a lower level that was more appropriate for a lower division course. ADDITION OF PARALLEL OR SPATIAL COMPUTING CONCEPTS There are many challenges in adding material on basic concepts of spatial computing to an early machine organization course, although in several respects this is a logical place to introduce them. Students in the first circuits and logic design courses are always thinking in terms of parallel operation of the elementary circuit components. In contrast, current efforts to parallelize computation using arrays of complex processors presume a mature understanding of the processors capabilities. Complex issues of parallel processors include the general problems of partitioning a task among multiple processors, the structure of the communication and memory access between processors, and compiler design for multiple processors. Normally this material is found in an advanced computer architecture course and students in an introductory course would not have the background necessary to appreciate these issues. While it is not really feasible to address these complex issues of parallel processor architecture in the machine organization class, it is important that the students understand that parallel usage of distributed computational or sensor resources will be an important design consideration in the near future. All aspects of parallel computing need not be based on complex multiprocessor interaction which is deferred to advanced courses. Since the sophomore level course lays the foundation for design of datapaths and control units, it is possible to add material that will lead to options other than a traditional processor with an associated register file and instruction set. Implementation of several elementary processes used in digital signal and image processing provide examples for alternative computational architectures. These topics will be added to the last 25% of the course which typically focuses on recent advances in architectures or comparison of different general purpose processors. There is currently no widely accepted design methodology for implementation of parallel computing at this level, and examples should be chosen to illustrate the types of computation that might be used in a wide variety of applications. Use of an HDL allows the description of a design with parallel data paths and simultaneous computation, but the verbal level of abstraction and the sequential list of statements may obscure the parallel nature of the architecture from the student at this introductory level. More graphic tools are needed to emphasize the simultaneous utilization of a spatial array of computational resources and to assist in visualization of the data flow. The last laboratory experiments for this course will use a visual block diagram based design tool such as VAB or Simulink [5] and Xilinx System generator [11]. These laboratories will consider an alternate arithmetic approach and two applications that are usually thought of in terms of sequential processing, but could benefit from a parallel implementation. CSD Arithmetic The first new topic combines concepts of computation concurrency, customized datapaths and alternative techniques for performing arithmetic. It implements a digital filter directly using canonic sign digit (CSD) arithmetic. Instruction set machines, as used earlier in the course, typically use conventional two s complement arithmetic or possibly floatingpoint arithmetic, and implement the filter using data queues. The very architecture of these devices restricts the designer from implementing computations with say, finite field arithmetic, or redundant arithmetic. These avenues can be explored for digital filtering applications in the spirit of producing a reduced complexity implementation that is optimized in some sense, for example, minimum number of multiplications. A CSD filter is a multiplierfree approach to the filter implementation. Coefficients are coded using a radix2 number system where digit positions in a word are in the set {1,0,+1}. The reader is referred to [7] for a more complete description of filter implementation using CSD techniques. CSD filters are commonly used in VLSI signal processing devices for implementing matched filters [8] and supplying x/sin(x) correction for example. These same techniques are highly applicable to FPGA implementations of these types of processing. Our specific example is based on the DVB demodulator CSDbased matched filter described in [8]. The objective is to implement the 9tap FIR filter characterized in Table I. In this example the student explores computation parallelism, filter architecture options, and computing using an alternative number system. All of this is achieved in the context of a very real example and can be implemented using FPGA technology in the laboratory. FIGURE 1 shows the implementation that uses only shifts, addition, subtraction, and delay elements. This implementation uses a transposed structure that also exploits coefficient symmetry. The filter structure and coefficient are provided to the students, who will compare this implementation to a previous laboratory which implemented a realtime FIR filter with a DSP. F4C17

4 4 TABLE I: DETAILS OF DVB MATCHED FILTER. Coefficient Value 2's Complement CSD Representation h0 h 1 = h1 h 2 = h2 h 3 = h3 14 8 1 2 01110 01000 11111 11110 100 10 01000 0000 1 000 10 h = h 1 00001 00001 Image Processing Example The multidimensional nature of image data makes image and video processing applications a natural candidate for addressing with computing architectures that supply concurrency at the data and functional unit level. This approach has been adopted by practitioners for many years, but until recently it would have been difficult to include in an undergraduate class. Certainly, there are few courses provide students with any practical experience in this area. A first image processing laboratory might extend the one dimensional filters used earlier to two dimensional data with a simple 2D filter experiment. FIGURE 2 shows the top level implementation of a 5x5 filter. Moving from left to right we can observe the data source (read from the Matlab workspace), the line buffers, 5x5 filter engine and the output sink. In this example the processed input image is written to the Matlab workspace. If FPGA hardware is available with a suitable video output device, the output can be displayed on a regular VGA monitor as would commonly be found in an undergraduate laboratory. The 5x5 filters can implement smoothing or sharpening operations, and real time input from a PC camera may also be used. The students are introduced, and required to construct for themselves, parallel memories (line buffers) for storing the image data, in addition to implementing a concurrent structure for computing the 2D inner product. In the example shown in FIGURE 3 the convolver is resourced with five inner product engines. Each engine is tasked with computing a 5tap filter. A total of 5 multipliers is used in this example. FIGURE 4 shows the input and output images for the simulation model. This example can be extended in many useful ways to reinforce different concepts. For example, the five timeshared filters could be redesigned by students to be fully parallel functional units, each module now using five FPGA embedded multipliers, instead of the one embedded multiplier in our earlier example. The implementation would now use a total of 25 multipliers compared to the 5 used in the design of FIGURE 2. The major thrust of the lab experiment is to provide students with handson experience using parallel hardware. And further, to expose them to the various tradeoffs that are available in stateoftheart computing technologies like FPGAs. In this simple example, students gain exposure and experience to a methodology for producing parallel hardware, making throughput area tradeoffs, and constructing customized data paths. ASSESSMENT The main objective of the modifications to the introductory digital architecture course is the introduction of concepts of parallel and spatial computation into the earliest treatments of architecture. These modifications need to be assessed from several perspectives. The primary concerns will be the impact on student understanding of basic principles of computational architectures and the integration of the new material with existing course objectives and content. The effectiveness of the new laboratory experiences must also be carefully evaluated to determine how well they reinforce concepts and whether the level of abstraction is appropriate. Initial feedback from students is that the laboratories are interesting and that the realtime processing of audio frequency signals from a signal generator as well as audio input from a microphone of CD provides more reward for their effort than the earlier laboratories in which they single stepped through instructions and observed the register and memory contents. The image processing component also provides motivation as well as demonstrating processing options. The block diagram specification of the process illustrates both the potential for parallel processing and a higher level design that is converted into machine instructions or FPGA bitmaps for realtime execution. While the course appears to achieve the objective of having students design realtime processing applications using the wide range of design capture for FPGAs, assembly language and C programming instructions, and block diagram specifications, the role of the processor will be examined further. The complexity of the VLIW processor is not specifically addressed in the course, and other processor options will be considered. This course also achieved some success in terms of horizontal integration with other courses taken at the same time. Previously this course was a continuation of state machine design from the basic logic design course with an emphasis on specific assembly language programming skills that might later be used in a microprocessor or architecture course. Now it connects to the circuits course since the students create frequency selective circuits and implement them in realtime. It also connects to the data structures course with a register transfer view of queues, stacks, and pointers. CONCLUSIONS This paper describes a strategy for the introduction of some basic concepts of parallel computation suitable for FPGA implementation into a lower division machine organization course. The "Introduction to Digital Systems Architecture" course had already been modified from a traditional machine organization course to incorporate implementation, but not design, of FIR and IIR digital filters to demonstrate the course concepts. The course has been further modified to add some F4C18

exposure to concepts of spatial computation that are expected to play an increasingly important role as FPGAs are used in more applications traditional handled by general or special purpose processors. It is expected that the new course design will better prepare students both for advanced digital systems courses and also for signal processing and communications electives. REFERENCES [1] Blahut, R. E. Fast Algorithms for Signal Processing, Addison Wesley, 1985. [2] Dick, Chris, Rediscovering Signal Processing: A Configurable Logic Based Approach, Proc. 37 th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, California, November 912, 2003. [3] Foley, J.D. and A. Van dam, Fundamentals of Interactive Computer Graphics, Addison Wesley, Reading Massachusetts, 1982. [4] Hennessy, J. L. and D. A. Patterson, Computer Architecture: A Quantitative Approach, 3rd edition, Morgan Kaufmann Publishers, San Francisco CA, 1995. [5] The Mathworks, Inc., Using Simulink, 2002. [6] McCartney, S. ENIAC, The Triumphs and Tragedies of the World s First Computer, Walker and Company, New York, 1999. [7] Parhi, K. K. VLSI Digital Signal Processing Systems, Wiley, 1999. [8] Meyr, H, and M. Moeneclaey and S. Fechtel, Digital Communication Receivers Synchronization Channel Estimation and Signal Processing, Wiley, 1998. [9] Texas Instruments http://dspvillage.ti.com/docs/dspvillagehome.jhtml [10] Wood, Sally L., DSP Second: A Sophomore Level DSP Architecture Course in the Electrical Engineering Core, Proc. SPE 2000: First Signal Processing Education Workshop, Hunt, Texas, 1518 October 2000. [11] Xilinx Inc., System Generator for DSP, http://www.xilinx.com/xlnx/xil_prodcat_product.jsp?title=system_gener ato [12] Xilinx Inc., VirtexII Pro Platform FPGAs, http://www.xilinx.com/xlnx/xil_prodcat_landingpage.jsp?title=virtex II+Pro+FPGAs xn () 7 a0 a1 7 7 2 a 3 8 16 8 1011 a 2 9 10 11 11 8 2 a 4 yn () 15 14 14 13 12 Unit delay FIGURE 1: CSD FILTER IMPLEMENTATION OF DVB RECEIVED MATCHED FILTER. F4C19

FIGURE 2: 5X5 CONVOLVER (TOP LEVEL) IMPLEMENTED IN SYSTEM GENERATOR. FIGURE 3: SYSTEM GENERATOR IMPLEMENTATION OF A 5X5 FILTER. F4C20

Original Image Image Filtered with Edge Operator FIGURE 4: Edge Detector Input And Output Images. F4C21