A Self-Reconfigurable Implementation of the JPEG Encoder

Size: px

Start display at page:

Download "A Self-Reconfigurable Implementation of the JPEG Encoder"

Gladys Hicks
6 years ago
Views:

1 A Self-Reconfigurable Implementation of the JPEG Encoder Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto Politecnico di Milano, Dipartimento di Elettronica e Informazione Via Ponzio 34/5, Milano, Italy {tumeo,monchier,gpalermo,ferrandi,sciuto}@elet.polimi.it Abstract Dynamic reconfiguration allows to selectively substitute blocks of logic at run-time in order to improve the area efficiency of a FPGA design. This paper presents the design of a JPEG Encoder which exploits this feature. We propose a mixed HW/SW architecture, where most computeintensive components of the application are mapped to application-specific HW cores. These cores dynamically alternate on the FPGA. Our purpose is to describe a realworld application of reconfigurable computing, illustrating how this approach allows for saving area with negligible performance overhead. We built a fully-working prototype, which demonstrates that the reconfigurable JPEG encoder achieves 29.6% area saving, 1.5% performance loss, and negligible power overhead with respect to a solution which uses statically mapped HW cores. 1 Introduction Reconfigurable platforms have emerged as an important alternative to ASIC design, featuring flexibility versus relatively lower performance [3,14]. Nevertheless, for mediumhigh volume production, the area cost of a FPGA device is still appreciable if compared to ASIC. For ASIC, the cost is mainly dominated by other factors (design and fixed costs) than area. Although time to market may favor FPGA with respect to ASIC for the whole project, area-dependent costs still exist for FPGAs. These reasons make techniques to efficiently manage area in FPGAs crucial to achieve a cost-effective design. Dynamic reconfiguration of some portion of logic is a way to reuse area, thus saving resources. Nevertheless, some drawbacks exist, related in particular to reconfiguration time.reconfiguring a FPGA consists of writing a specific bitstream with the information on how configurable logic blocks and switch matrices must be re-programmed. Since the research on the reconfigurable architectures is quite wide, the community uses three main points to classify a reconfigurable approach: when, how and where the reconfiguration takes place. When. The reconfiguration can be static or dynamic. Static reconfiguration is done when the device is still inactive, while dynamic reconfiguration, takes places while the FPGA is running, and can be directed by the application itself. How. The reconfiguration can be full or partial. Full reconfiguration concerns the complete FPGA, while partial reconfiguration concerns only a portion of the FPGA (the remaining logic maintains its functions). Where. The reconfiguration can be external or internal. The reconfiguration is external when another device re-programs the FPGA, while it is internal when the FPGA itself loads the bitstream and reconfigures. In this case, a specific device (module) is needed inside the FPGA to perform these operations. It is important to notice that, if the external reconfiguration latency can prohibit dynamic reconfiguration (or making it significantly more complex), internal reconfiguration can be suitable for well balanced adaptive systems. Internal reconfiguration, a.k.a self-reconfiguration has been explored only recently, specifically for the analysis of realworld applications. The future of dynamic reconfiguration is still a bit uncertain. Up to our knowledge, Xilinx is the only FPGA vendor providing a comprehensive support, other providers chose not to include such features due to reliability issues. Unlike many previous works, focused on technology issues of dynamic reconfiguration, we believe that the architecture/application-level study of these systems is fundamental to better understand the pros and cons of this technology. In this paper, we present a dynamic-partial-internal (self-reconfigurable) approach to the design of the JPEG Encoder. Our design uses a HW/SW computation model, where a software program off-loads part of the computation to the FPGA hardware. We propose a HW/SW architecture which dynamically alternates two hardware cores (RGB to YUV Color Space Conversion and 2D Discrete Cosine Transform) by reconfiguring a portion of the FPGA logic. We show that our architecture can save 29.6% area at /07/$ IEEE 24

2 the cost of 1.5% performance loss, with respect to a more aggressive solution featuring all hardware cores statically mapped on the FPGA. In addition, we evaluated the power consumption of our architecture by doing current absorption measures, showing that the reconfiguration has a negligible energy and power overhead. The paper is organized as follows. After introducing some related works (Section 2), we describe the design steps needed to achieve a working prototype (Section 3), and thus we discuss the proposed architecture (Section 4). We eventually present the experimental evaluation of our design (Section 5). 2 Related Work The JPEG is a typical benchmark application for many systems and methodologies. For example, let us mention the paper by Narasimhan et al. [8] about HW/SW codesign, and Shee et al. [11] about heterogeneous multiprocessor SoCs. Regarding reconfigurable architectures and platforms, the literature focused on several different aspects. In the following we discuss the most important ones related to this paper. Reconfiguration Technologies. Most works focus on technologies and design techniques to better support reconfiguration. Among the most recent ones, we mention the paper by Lysaght et al. [7]. The authors describe the architectural enhancements to latest Xilinx FPGAs related to reconfiguration latencies, modular design (i.e. flexible reconfiguration areas), and static/dynamic region interfaces. Since achieving efficient modular design is the key for many system designs, while reconfigurable regions must be statically determined, a few papers proposed some design techniques to better deal with this problem. Sedcole et. al. [10] discuss two dynamic reconfiguration techniques for 1D and 2D reconfiguration. Hubner et al. [6] propose on-demand partial reconfiguration approaches. Reliability is a well known issue for partially reconfigurable hardware, since the bitstream download can causes electrical glitches. Paulsson et al. [9] describe some fault detection techniques integrated in a self-reconfigurable system. Design Methods. Appropriate frameworks and methodologies are needed to support efficient design. Burns et al. [2] present a run-time software system, which interacts with the FPGA to handle the reconfiguration. Donato et al. [4] focus on the design methodology, describing a design flow for embedded applications. Banerjee et al. [1] present a HW/SW codesign framework for partial (external) dynamic reconfiguration. They use integer linear programming and consider some issues such as configuration prefetch for minimizing latencies. Reconfigurable Computing Paradigms. Reconfigurable units have been proposed to extend functions and improving efficiency of conventional processor-based computing. Along this direction, Hauck et al. [5] propose a system (Chimaera) that integrates reconfigurable logic into a host processor with direct access to the host processor s register file. This enables the creation of multi-operand instructions and a speculative execution model. Vassiliadis et al. [15] present MOLEN: a polymorphic processor paradigm, incorporating run-time programmable units. Our computation model may resemble MOLEN, since it follows the general idea of a HW/SW system off-loading tasks to the hardware, which can be considered quite a widely accepted guideline to design these systems. Suri [12] proposes an architecture where a reconfigurable unit is coupled with the superscalar data-path. Hot traces of instructions are determined at runtime and then mapped onto the reconfigurable units. These works are orthogonal to ours. In fact, we move along a different philosophy. Our intent is not to propose some reconfiguration techniques, a processor-extension or a general methodology. Instead we focus on a specific application, proposing an optimized architecture and a computation model, which uses reconfiguration to improve its efficiency. 3 Implementing Partial Dynamic Reconfiguration Implementing internal partial dynamic reconfiguration is not a fully automatic process yet. Xilinx offers an initial support for its FPGA devices through specific updates to its toolchain. Both Virtex-II and Virtex 4 families now support 2D module based reconfiguration following a specific design and implementation flow. Nevertheless there are some issues that need to be addressed, as it will be explained shortly. The design flow to implement partial dynamic reconfigurable architectures can be roughly divided in three phases: Initial budgeting. In a first phase, it is necessary to define the constraints of the design in terms of area dedicated to each reconfigurable module. If two or more modules are determined to share the same location, the minimum area dedicated to them is the area of the largest module. Since the fixed and the reprogrammable part of the architecture need to communicate with each other, a fixed communication path is necessary as well. This is defined placing specific bus macros on the boundary of the two regions, through which all the communications between the two parts will pass. This of course means that the fixed part and all re-programmable modules communicate through the same interface. Careful placement of bus macros should also be considered for timing constraints. In fact, concentrating all the communication in the same locations can create routing problems. Implementation. The second phase is the implementa- 2 25

3 tion of the base design, which is the fixed part of the architecture, followed by the implementation of each reconfigurable module. During this phase the base design and the reconfigurable modules are mapped and routed independently from each other. It is necessary to check if timings of the different parts are respected. Assembly. The last phase of the flow is the assembly of the fixed and reconfigurable parts. The final bitstreams are generated: full bitstreams with the fixed part and one of the module for each area, and partial bitstreams with just the modules to reprogram specific locations. Blanking bitstreams are also generated. After loading one of the full bitstreams it is now possible to reconfigure part of the device with a partial bitstream by means of external or internal reconfiguration interfaces. 4 Architecture This section presents our base system (Paragraph 4.1) and the complete architecture for the reconfigurable JPEG encoder (Paragraph 4.2). 4.1 Base System Our reconfigurable JPEG encoder is based on the MicroBlaze soft processor from Xilinx. The design was implemented with the Embedded Developer Kit (EDK) 8.2 and the MicroBlaze version 5.00c. MicroBlaze 5.00c is a fivestage pipelined RISC processor with a standard Harvard architecture. The processor connects to a Local Memory Bus (LMB), a On-Chip Peripheral Bus (OPB), and a Fast Simplex Link (FSL). The latter is the point-to-point connection used to implement the hardware accelerators. Figure 1 illustrates the organization of the system. Our base architecture is implemented on a Virtex II-Pro XC2VP30-FF896 speed grade -7. The processor is connected to the OPB, together with the external memory of 256MB (DDR RAM), the Microprocessor Debug Module (MDM), the timer peripheral (to debug and profile the application), and the controllers for the UART and the Compact Flash (Sysace). The Flash is used to store the input and output files and initially the bitstreams for reconfiguration. The MicroBlaze has been configured with 2 KB instructions and 8 KB data caches for the external memory. The architecture also incorporates the wrapper for the Internal Configuration Access Port (ICAP) of the FPGA, which enables the processor to reconfigure the device at runtime. 4.2 Application Partitioning and Mapping The software application implements the baseline JPEG compression algorithm with Huffman coding and is composed of six phases: (i) original image (.PPM format) read- Figure 1. Overview of the Reconfigurable JPEG Encoder architecture ing, (ii) RGB to YUV color spaces conversion (for color images), (iii) expansion and downsampling, (iv) quantization tables setting, (v) bi-dimensional Discrete Cosine Transform (2D-DCT), Quantization and zig-zag reordering, (vi) entropic coding and file saving 1. Accelerators. It is easy to see that the RGB to YUV color space conversion and the 2D-DCT steps are the most computationally intensive phases of the compression algorithm. Starting from this point, we identified them as the kernels for the hardware acceleration. Thus, we described in VHDL two specific IP cores to execute the RGB to YUV and the 2D-DCT. The RGB to YUV IP core executes three multiplications, four additions and a shifting for each component of a single pixel, as required by the standard integer conversion formulas. A major speed up is obtained thanks to the fact that the color space conversion is done in parallel for multiple pixels. In fact our RGB to YUV hardware accelerator is filled with RGB data relating to 16 pixels and produces 16 YUV encoded pixels for each execution. On the other hand, the 2D-DCT implementation is an innovative architecture optimized for area-delay trade-off [13]. The core implements a fast 2D-DCT algorithm optimized to reduce the number of functional units. Since the 2D-DCT can be decomposed in two 1D-DCT executions, one on the rows of the initial block of samples (8x8 pixels as required by the JPEG standard) and the other on the columns of the resulting matrix from the first transform, we reuse the same seven stage pipeline with a special transpose memory. Area details of the two IP Cores are reported in Table 1. Our target is a timing specification of 50 MHz which is perfectly satisfied by both designs. The data show huge utilization in terms of slice flip-flops for the 2D-DCT, since they are used to implement the transpose memory. Instead, the number of functional units is limited, since one of the design targets for this accelerator is to obtain a good per- 1 The application accounts for 250 KB of compiled code, thus it is downloaded entirely to the external memory of the device. 3 26

4 Table 1. Resources utilization of the two hardware accelerators Table 2. Execution time breakdown for the three implemented architectures Resource RGB to YUV 2D-DCT Slice Flip-Flops Input LUTs Slices formance/occupation balance. The RGB to YUV core uses less memory units, but requires more functional units since computations for multiple pixels are done in parallel. The synthesizer reports that the utilization estimation in terms of Slices 2 of the 2D-DCT core is only 10% higher than the RGB to YUV core, so at the end the 2D-DCT will result only slightly bigger than the RGB to YUV accelerator. Since in the JPEG encoding algorithm the two phases are executed in different times, this makes the two cores good candidates for sharing the same area in a reconfigurable architecture flow. Integrating the Accelerators in the System. The RGB to YUV and the 2D-DCT IP cores are assigned to the same area, which is sized after the occupation of the 2D-DCT hardware accelerator. All the other locations are free for placing and routing of the fixed part. It must be observed that the fixed part has been slightly modified by hand with respect to the standard implementation obtained by EDK, in order to adhere to the rules for placement of clock generators and buffers, and to permit the placement of the bus macros. These restrictions are imposed by the Xilinx reconfigurable flow. To connect the fixed part to the reconfigurable units, standard slice bus macros with enable signals managed by a specific wrapper have been used. Execution Flow. The reconfigurable JPEG encoder is started downloading to the FPGA the full bitstream incorporating the RGB to YUV accelerator. During a bootstrap phase, partial bitstreams are read from the compact flash memory and stored in the external DDR RAM, ready for reconfiguration. Then the JPEG encoding algorithm starts. The input image is read from the compact flash and the hardware RGB to YUV color space conversion is executed. Phase (iii), expansion and downsampling, and (iv), quantization tables setting, follow. At this point, the MicroBlaze launches a routine that permits to dynamically reconfigure the architecture. The processor reads the 2D-DCT partial bitstream from the memory and writes it to the ICAP. Thus, it executes internal partial dynamic reconfiguration. When the reconfiguration completes, the algorithm can perform 2 Xilinx defines a slice as a group of two slice flip-flops and two 4-input LUTs for the Virtex-II Pro devices. The number of required slices is an indication of the minimum effective area required by a design, although the placer can decide to group not related logic in the same slice. Phase (kcycles) Full Soft Both IP Rec Reading RGB to YUV Downsample and Exp Set Quant. Table RGB to YUV DCT DCT and Quant DCT RGB to YUV Ent. Coding and Saving Total the 2D-DCT using the hardware accelerator. The RGB to YUV core is then reconfigured in place of the 2D-DCT core to allow new iterations of the algorithm. Finally, the application executes Huffman coding and stores the resulting JPEG file to the compact flash. Note that sizes of full bitstreams depend on the physical size of the device, while sizes of partial bitstreams depend only on the size of the reconfigurable area. Thus, the size of partial bitstreams for different modules sharing the same area does not change even if they do not use all the resources in the area, since all the allocated space must always be configured to a known and safe state. Reconfiguration time is directly proportional to the size of the partial bitstream, thus smaller areas for reconfigurable modules mean less reconfiguration overhead. Nevertheless, high utilization of a reconfigured module can justify longer reconfiguration times. 5 Experimental Results In this section we will show the comparison of the proposed reconfigurable implementation of the JPEG encoder with two alternatives. The first one, called Full Software, is a complete software version of the JPEG encoder running on the MicroBlaze. The second one, called Both IP Cores, is a software version of the JPEG encoding where both the RGB to YUV and the DCT phases have been accelerated by using hardware coprocessors. The proposed Reconfigurable architecture uses the same software code of the Both IP Cores architecture, apart from the functions used to implement the reconfiguration. The input dataset for the JPEG encoder is a 160x120 pixels ppm image (60KB). Area. Figure 2(a) shows the area, in terms of number of slices, for the three architectures. It can be seen that the Both IP Cores architecture uses around the 270% of the resources with respect to the Full Software architecture. This is due to the hardware coprocessors. On the other hand, 4 27

5 Area [# of Slices] Execution Time [cycles] E E E E E E E+00 Full Software Both IP Cores Reconfigurable (a) Area Full Software Both IP Cores Reconfigurable (b) Delay Figure 2. Area and Delay for the three architectures in terms of execution cycles using the proposed Reconfigurable architecture, the area in terms of slices is reduced by 30% with respect to the Both IP Cores architecture. Obviously this reduction is due to the partial reconfiguration, which allows overlapping of the area of the two coprocessors. Performance. Figure 2(b) shows the execution cycles. It is easy to see the advantage given by the two hardware coprocessors. In fact, the Both IP Cores architecture shows a speed-up of 3.02 with respect to the Full Software implementation. Regarding the proposed Reconfigurable architecture, the execution time is only a 1.5% worse than Both IP Cores. This increase is due to the reconfiguration. Table 2 shows the details of the breakdown of execution cycles (thousands of cycles) regarding all phases of the algorithm. The Table better clarifies the advantage of the hardware accelerators for both RGB to YUV and DCT phases and the overhead of the reconfiguration phases. Since the reported results are considered as the value of the n-th execution of the JPEG, Table 2 shows the execution values for both the reconfigurations (DCT RGB to YUV and RGB to YUV DCT). Power. We collected the profiles of the current absorption for each architecture. We used an in-house measurement apparatus consisting of a 1 Ω resistor inserted in the power supply path of the board. We made sure that most of the board peripherals were deactivated and the board was isolated from the environment, apart from the power supply. We collected 10 measures for each current profile to make sure to exclude disturbs and noise in our considerations, and we averaged to reduce the effects of the timeuncorrelated noise. The power consumption of the board is 945 ma (1.14W) with the FPGA idle. Once our system is booted, it starts sinking additional 135 ma. Figure 3 shows the current absorption for the whole JPEG execution. You can easily distinguish the main phases of the algorithm: File Reading, DCT, Entropic Coding/File Saving. Other phases cannot be easily noticed, since they are too short. In the File Reading phase, you can easily see the power consumed by the accesses to the Flash (6 chunks of 10KB). The only noticeable HW phase is the DCT, while the RGB to YUV cannot be observed. During the DCT the processor is stopped by blocking reads, and the current profile is quite flat, since the memory system is not directly accessed. The only data movement is for the data blocks from the DRAM (or the DCache) to the DCT core via FSL, but this happens only 300 times in 4 seconds (the related spikes are filtered by the board capacitors). The last phase is much more memory intensive, involving both the data and instruction caches, and many small writes (byte) to the Flash. This makes the current profile much more jaggy. The reconfiguration phases are just before the DCT and soon after. In Figure 4, we repeated each reconfiguration 10 times to reveal their current consumption. This is on average 1035 ma. So we can easily find the energy cost of the reconfiguration, which is 90 ma 0.13 s (excluding the baseline board consumption) i.e. the 0.46% of the energy consumption of the whole JPEG run. Notice that the bitstreams have been cached in the internal DRAM before the execution. 6 Conclusions In this paper we presented a self-reconfigurable implementation of the JPEG encoder. This consists of a mixed HW/SW architecture which alternates two computation intensive hardware cores on the FPGA. We have shown that from the area/performance/power point of view this architecture represents the best trade-off if compared with a fully software implementation or a standard hardwareaccelerated version. Although some issues regarding dynamic reconfiguration still exists, we think that our work represents a further step in the direction of better understanding the potentials of this technology. In particular, this paper contributes in providing an analysis of a HW/SW implementation of the JPEG encoder, exemplifying the design of a reconfigurable multimedia application, and showing the 5 28

6 Current [ma] Time [s] Figure 3. Current profile for the Reconfigurable architecture Current [ma] Time [s] Figure 4. Current profile for the reconfiguration phases (iterated 10 times each one) viability of this approach. References [1] S. Banerjee, E. Bozorgzadeh, and N. D. Dutt. Integrating physical constraints in hw-sw partitioning for architectures with partial dynamic reconfiguration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(11): , [2] J. Burns, A. Donlin, J. Hogg, S. Singh, and M. D. Wit. A dynamic reconfiguration run-time system. In FCCM 97: 5th IEEE Symposium on FPGA-Based Custom Computing Machines, page 66, [3] K. Compton and S. Hauck. Reconfigurable computing: a survey of systems and software. ACM Comput. Surv., 34(2): , [4] A. Donato, F. Ferrandi, M. Redaelli, M. D. Santambrogio, and D. Sciuto. Caronte: A complete methodology for the implementation of partially dynamically self-reconfiguring systems on fpga platforms. In FCCM 05: 13th IEEE Symposium on FPGA-Based Custom Computing Machines, pages , [5] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The chimaera reconfigurable functional unit. In FCCM 97: 5th IEEE Symposium on FPGA-Based Custom Computing Machines, page 87, [6] M. Hubner, C. Schuck, M. Kuhnle, and J. Becker. New 2- dimensional partial dynamic reconfiguration techniques for real-time adaptive microelectronic circuits. In ISVLSI 06: Annual Symposium on Emerging VLSI Technologies and Architectures, page 97, [7] P. Lysaght, B. Blodget, J. M. J. Young, and B. Bridgford. Enhanced architectures, design methodologies and CAD tools for dynamic reconfiguration of xilinx FPGAs. In FPL 06: International Conference on Field Programmable Logic and Applications, [8] N. Narasimhan, V. Srinivasan, M. Vootukuru, J. Walrath, S. Govindarajan, and R. Vemuri. Rapid prototyping of reconfigurable coprocessors. In ASAP 96: International Conference on Application Specific Systems, Architectures and Processors, pages , Aug [9] K. Paulsson, M. Hubner, M. Jung, and J. Becker. Methods for run-time failure recognition and recovery in dynamic and partial reconfigurable systems based on xilinx virtex-ii pro fpgas. In ISVLSI 06: Annual Symposium on Emerging VLSI Technologies and Architectures, page 159, [10] P. Sedcole, B. Blodget, J. Anderson, P. Lysaght, and T. Becker. Modular partial reconfigurable in Virtex FP- GAs. In FPL 05: International Conference on Field Programmable Logic and Applications, pages , [11] S. L. Shee, A. Erdos, and S. Parameswaran. Heterogeneous multiprocessor implementations for JPEG: a case study. In CODES+ISSS 06: 4th international conference on Hardware/software codesign and system synthesis, pages , [12] T. Suri. Improving instruction level parallelism through reconfigurable units in superscalar processors. In RAAW 06: Reconfigurable and Adaptive Architecture Workshop, [13] A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto. A Pipelined Fast 2D-DCT Accelerator for FPGAbased SoCs. In ISVLSI 07: Annual Symposium on Emerging VLSI Technologies and Architectures, [14] F. Vahid. The softening of hardware. Computer, 36(4):27 34, [15] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. Panainte. The MOLEN polymorphic processor. IEEE Transactions on Computers, 53(11): , Nov

A High Definition Motion JPEG Encoder Based on Epuma Platform

Available online at www.sciencedirect.com Procedia Engineering 29 (2012) 2371 2375 2012 International Workshop on Information and Electronics Engineering (IWIEE) A High Definition Motion JPEG Encoder Based