Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

Similar documents
Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Signal Processing in Mobile Communication Using DSP and Multi media Communication via GSM

Dr. D. M. Akbar Hussain

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

IMPLEMENTATION OF G.726 ITU-T VOCODER ON A SINGLE CHIP USING VHDL

A High Definition Motion JPEG Encoder Based on Epuma Platform

SOFTWARE IMPLEMENTATION OF THE

WHITEPAPER MULTICORE SOFTWARE DESIGN FOR AN LTE BASE STATION

ASIP Solution for Implementation of H.264 Multi Resolution Motion Estimation

Lesson 7. Digital Signal Processors

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

Instruction Level Parallelism Part II - Scoreboard

Hardware-Software Co-Design Cosynthesis and Partitioning

Lecture 1: Introduction to Digital System Design & Co-Design

4.4 Implementation Structures in FPGAs and DSPs. Presented by Lee Pucker President, ForwardLink Consulting

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Compiler Optimisation

A High-Throughput Memory-Based VLC Decoder with Codeword Boundary Prediction

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Project 5: Optimizer Jason Ansel

A GENERIC ARCHITECTURE FOR SMART MULTI-STANDARD SOFTWARE DEFINED RADIO SYSTEMS

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Outline Simulators and such. What defines a simulator? What about emulation?

Run-time Power Control Scheme Using Software Feedback Loop for Low-Power Real-time Applications

Data Transmission at 16.8kb/s Over 32kb/s ADPCM Channel

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

Implementation of FPGA based Design for Digital Signal Processing

Distributed Vision System: A Perceptual Information Infrastructure for Robot Navigation

Dynamic Scheduling I

Michael Clausen Frank Kurth University of Bonn. Proceedings of the Second International Conference on WEB Delivering of Music 2002 IEEE

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Design and Implementation of Signal Processing Systems: An Introduction

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

NOISE SHAPING IN AN ITU-T G.711-INTEROPERABLE EMBEDDED CODEC

Dynamic MIPS Rate Stabilization in Out-of-Order Processors

AN EFFICIENT ALGORITHM FOR THE REMOVAL OF IMPULSE NOISE IN IMAGES USING BLACKFIN PROCESSOR

Power Issues with Embedded Systems. Rabi Mahapatra Computer Science

A Framework for Fast Hardware-Software Co-simulation

Performance Evaluation of Recently Proposed Cache Replacement Policies

DURIP Distributed SDR testbed for Collaborative Research. Wednesday, November 19, 14

Introduction to co-simulation. What is HW-SW co-simulation?

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

Power Reduction Technique in Coefficient Multiplications Through Multiplier Characterization

Low-Power CMOS VLSI Design

Simulation of Conjugate Structure Algebraic Code Excited Linear Prediction Speech Coder

A Novel Approach of Compressing Images and Assessment on Quality with Scaling Factor

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION

EDA for IC System Design, Verification, and Testing

FPGA Based 70MHz Digital Receiver for RADAR Applications

Video Encoder Optimization for Efficient Video Analysis in Resource-limited Systems

AI Application Processing Requirements

COTSon: Infrastructure for system-level simulation

Transcoding of Narrowband to Wideband Speech

Cooperative Cross-Layer Protection for Resource Constrained Mobile Multimedia Systems

EMBEDDED systems are those computing and control

Recent Advances in Simulation Techniques and Tools

When to use an FPGA to prototype a controller and how to start

Fixed Point Lms Adaptive Filter Using Partial Product Generator

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

FPGA Implementation of High Speed Infrared Image Enhancement

Vol. 4, No. 4 April 2013 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Cooperative Wireless Networking Using Software Defined Radio

An Efficent Real Time Analysis of Carry Select Adder

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

VLSI System Testing. Outline

IJMIE Volume 2, Issue 5 ISSN:

Multi-core Platforms for

An Design of Radix-4 Modified Booth Encoded Multiplier and Optimised Carry Select Adder Design for Efficient Area and Delay

Design of High-Performance Intra Prediction Circuit for H.264 Video Decoder

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

Adaptive Modulation and Coding for LTE Wireless Communication

Datorstödd Elektronikkonstruktion

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

SCALCORE: DESIGNING A CORE

A HIGH SPEED FFT/IFFT PROCESSOR FOR MIMO OFDM SYSTEMS

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

DELAY-POWER-RATE-DISTORTION MODEL FOR H.264 VIDEO CODING

An Area Efficient Decomposed Approximate Multiplier for DCT Applications

CSE 2021: Computer Organization

WiMAX Basestation: Software Reuse Using a Resource Pool. Arnon Friedmann SW Product Manager

A SOFTWARE RE-CONFIGURABLE ARCHITECTURE FOR 3G AND WIRELESS SYSTEMS

CMP 301B Computer Architecture. Appendix C

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

HIGH QUALITY AUDIO CODING AT LOW BIT RATE USING WAVELET AND WAVELET PACKET TRANSFORM

Audio Compression using the MLT and SPIHT

Evaluation of Kalman Filtering Based Channel Estimation for LTE-Advanced

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Digital Signal Processing. VO Embedded Systems Engineering Armin Wasicek WS 2009/10

OVER THE REAL-TIME SELECTIVE ENCRYPTION OF AVS VIDEO CODING STANDARD

Department Computer Science and Engineering IIT Kanpur

Computer Aided Design of Electronics

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

Implementation of a Visible Watermarking in a Secure Still Digital Camera Using VLSI Design

Using an FPGA based system for IEEE 1641 waveform generation

Low-Power Digital CMOS Design: A Survey

Using SDR for Cost-Effective DTV Applications

CS Computer Architecture Spring Lecture 04: Understanding Performance

Using Modern Design Tools To Evaluate Complex Communication Systems: A Case Study on QAM, FSK and OFDM Transceiver Design

Transcription:

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling IVANO BARBIERI, MASSIMO BARIANI, ALBERTO CABITTO, MARCO RAGGIO Department of Biophysical and Electronic Engineering University of Genoa Via Opera Pia 11A ITALY Abstract: - In this document the Instruction Set Architecture (ASI) simulation issue is discussed. Typical tradeoffs between Flexibility, Speed and Accuracy are shown. Based on hypothesis on Architecture approach (VLIW) and Applications of interest (DSP and Multimedia), this article presents a solution representing a challenging compromise in ASI simulation. A fast, accurate and flexible ASI simulation environment has been implemented using a Simulation cache, innovative pipeline status modeling (Three dimensional), and Simulation Oriented Hw description. Results on two architecture case studies have been reported to validate the described approach: TI TMS320C62x and ST200. Key Words: VLIW, ILP, Hw-Sw Co-design, DSP, Multimedia, Development Tools, Instruction Simulation, Simulation Speed, Simulation Accuracy. 1 Introduction The increasing of computational power requirements for DSP and Multimedia application and the needs of easy-to-program development environment has driven recent programmable devices toward VLIW (Very Long Instruction Word) [1] architectures and to Hw Sw co-design environment [2]. VLIW architecture allows generating optimized machine code from high-level languages exploiting Instruction Level Parallelism (ILP) [3]. Furthermore, applications requirements and time to market constraints are growing enormously transferring functionalities from Hardware to Software implementation [4] moving developers toward System on Chip programmable devices. System on Chip application driven design [4] seems to be the answer to fulfill DSP (Multimedia, telecommunication) application complex requirements. 2 VLIW Architecures and DSP Applications In the previous chapter we introduced ILP to describe how VLIW architectures optimize code execution. There are two possible points of view in considering ILP: the ILP available in a region of code and ILP achievable in the given architecture (Hw-ILP) [5]. The Software developer try to write code with ILP as close as possible the Hw-ILP. Elsewhere Architecture designer analyze the typical application code to match the same Hw-ILP in the Hw design process. Hw-ILP solutions are: Multiple Functional Unit to execute at the same time Multiple copies of functional units accessing different registers file Pipeline for Functional units with latency longer than one cycle Multimedia and in general digital signal processing applications have typically large available ILP [2], specific data acces requirements, repetitive numeric calculation, numeric fidelity, high memory bandwidth and real-time processing requirements [6]. A number of general-purpose processors are suitable for DSP task. Nevertheless DSP processors outperform general-purpose processors in costperformance rate and power consumption [7] [8]. One of the purpose for Architecture-exploration tools is to allows designers to find the best matching between available and achievable ILP through the interactions between code-development and architecture-parameters tuning tools (Application driven Architectures). Examples of this interaction are Instruction Set or Long Instruction Issue modifications [9].

The approach described in chapter 4 has been used to simulated state of the art VLIW architectures [10][11][12]. 3 Instruction Simulation Tradeoffs: Speed, Accuracy and Flexibility Instruction simulators are nowadays widely used in developing Application-driven architecture design. The architecture design process is strongly influenced by the tradeoff between simulation speed and accuracy: Cycle accurate or close to cycle accurate simulation have usually low performance [13] [14]. Better simulation performance can be obtained through vertical software optimization for a given target architecture, but in this case the tradeoff to take into account is between speed and flexibility. Interpretative re-configurable Instruction Set Simulator could efficiently model System on Chip (SoC) and run-time interactions between heterogeneous SoC parts (core, co-processors, etc.). VLIW-SIM ISS is intended to be a Sw application design support and architecture exploration tool. 4 VLIW-SIM: An Innovative Simulation Approach The VLIWSIM ISA Simulation environment is composed by a set of modules implementing pipeline, memory, Register file, I/D Cache, Instructions and System I/O modeling. Implementation requirement were Interpretive simulation approach, Efficient host memory allocation, Instruction Set Flexibility, Step by step pipeline status tracking, Simulation Speed and Accuracy. Further details on VLIW-SIM environment can be found in [9] [15]. In the following chapters (4.1 to 4.4) the main VLIW-SIM modeling approach is described. 4.1 Pipeline Modeling The pipeline status is represented in the simulation as a three-dimensional space (phase, operation, time) (Figure 1). The Phase axis represent pipeline phase. The Operation axis represents the instruction position in the Long Instruction Word. The Time axis represent the given time stamp. F D R W op4 op4 op1 op 2 op3 op4 op5 op 6 op7 op8 op8 op8 t t-1 t+1 Phase Figure 1 - Pipeline 3D Status for ST200 Time Operation The Pipeline status element is the operation internal representation. The simulation process is based on two dimensional arrays representing pipeline status on a given time stamp. Status evolutions take place using two pipeline status arrays representing the current and the following status (Figure 2). The first array s raw contains the last-fetched Long Instruction Word. The Pipeline Status progression is based on the following algorithm: instructions in a given pipeline phase in the current pipeline status are processed and transferred to the next pipeline phase in the following pipeline status. How the instructions are processed depends on the instruction type and on the phase they are. After all the phases have been updated the current pipeline status is turned into the following. Load fetch packet n Pipeline phases PG n1 n2 n3 n4 n5 n6 n7 n8 PS PW PR m1 m2 m3 m4 m5 m6 m7 m8 DP DC E3 E4 E5 Current Status Figure 2 - Pipeline Status Arrays for TI C62x 4.2 Simulation Cache Pipeline phases PG n1 n2 n3 n4 n5 n6 n7 n8 PS PW PR m1 m2 m3 m4 m5 m6 m7 m8 DP DC E3 E4 E5 Following Status Performance tests and code profiling have been performed on the simulator. The code profiling has shown that a major over-head is the instruction simulation, and in particular decode and dispatch are the two more frequently executed modules. Moreover typical VLIW applications are repetitive on small piece of code. This feature and the VLIW static scheduling allow introducing the concept of Simulation Locality: large part of code are iteratively simulated, therefore some of the simulation

internal data (e.g. decoded and dispatched instructions) could be re-used, saving simulation time. This idea results in the implementation of a Simulation Cache: a fast access memory based on spatial and temporal locality containing the last N fetch packet with already-processed simulation data. The cache management mechanism algorithm detects hit/miss on cached fetch packets and replaces (following a common block replacement algorithm) dismissible packets. Simulation cache and cache block sizes are definitely critical parameters for simulator performance. To improve significantly the simulated instruction per second (sips) rate simulation cache size should allow critical loop code placement. Measures on a wide set of multimedia applications have been performed to verify locality and to best fit cache and block (cache line) size. VLIW application locality hypothesis and measures on typical multimedia applications (Figure 3) allow expecting significant improvements introducing the simulation-cache in the simulation environment. 1200000 1000000 800000 600000 400000 200000 0 tot acc temp loc 422200 422300 422400 422500 422600 422700 422800 422900 422A00 422B00 422C00 422D00 422DE0 422F00 423000 423100 423200 423300 Figure 3 - H.263+ Code Temporal and Spatial Locality 4.3 Simulation-Oriented Hardware Description In order to implement a set of tools capable to simulate a generic VLIW processor, the simulation environment should process an Hw description as input. This description has simulation accuracy as main purpose; therefore it should only be focused on those Hw aspects relevant to simulation: Decode Architecture (Instruction Decoding Masks, instructions fields position in the codeword, field size and meaning) Branch architecture, Pipeline phase description (Duration, Size, mnemonic phase-name) and VLIW Parameterization. tot acc Decode Architecture: From the description, a set of C-like-Macros will be automatically created for each Instruction Mask, in order to extract instruction fields. Moreover each macro has to be associated with the proper entry in the Instruction Field Table (IFT). The IFT is a table containing all the possible meaning (operand, operation code, destination, flag, etc.) for an instruction field in a VLIW instruction, independently from position or size in the mask. Branch Architecture: The description supply pipeline phases where branch condition is evaluated and target address is computed. Pipeline description: Pipeline Number of phases and functions per phases (Fetch phases, Decode phases, Execute phases) I/D Cache Parameterization: Cache Size, Line(Block) Size, number of way (1 is direct mapping), Write on Fetch option, Block Replacement Algorithm. VLIW Parameterization: Long Instruction Word Size, Register File Size and organization (number of Register Banks and Bank s size), Control Register File size and organization. 4.4 Instruction Set Dynamic Generation Instruction Set Flexibility is a major feature to allow Application Driven Architecture exploration and architectural design evaluations [5][16]. The Instruction Set Dynamical generation Tool allows supplying to simulator a behavioral description of the Instruction Set. A language to describe general VLIW Instruction has been identified. For each instruction mnemonic name, instruction class, type of operand(s), destination, latency, operation code and the expression defining the instruction is supplied. If the instruction uses a control register, the description will contain the field to specify the used register (or parts of the register). Other characteristics can be specified depending on the instruction class. Instructions are divided into three different classes. Each class identifies a specific instruction type: memory operation, arithmeticlogical operation, branches. The description language allows the user to completely define instruction s behavior through the expression field. In the

expression field, it is possible specify in a C-like notation, all the relations between operands. The user-defined instruction set is taken as input of the Instruction Set Dynamical Generator (ISDG) parser. The parser analyze the description and produces an intermediate instructions representation used to produce the Instruction Set modules. 5 The Simulation Environment Performance on Two Case Studies In this chapter test on two VLIW target architectures are reported in order to validate VLIW-SIM simulation environment. The platform used for the test is a Pentium II 400 MHz, 128 MB RAM, Windows NT 4.0. OS. Two different benchmarks have been used: H.263+ [17] coder and G.723.1 [18] [19] encoder and decoder, both implemented in C. Tests on H.263+ encoder have been performed with the following parameters: Test Sequence: foreman.yuv Number of encoded frames: 10 (from frame 0 (Intra) to frame 9) Video input format: QCIF (176 x 144 pixel) Quantization index for P-frame: 10 Quantization index for I-frame: 10 Motion estimation search algorithm: Improved Gradient Descent Search Motion estimation search window: 15 Half Pixel Motion Estimation type: subset half pixels Tests on G.723.1 codec have been performed with the following parameters: Test Sequence: ITU standard Number of encoded and decoded frames: 20 (from frame 0 to frame 19) Audio input format: 8 KHz - 16 bit Output Rate: 5.3 Kbps 5.1 Texas Instruments TMS320C62x In the following test the target architecture is the TI TMS320C62x [10]. VLIWSIM performance and accuracy has been compared with TI state of the art simulator (Texas Instrument Fast Simulator TIFS). The Register files and the memory of the two simulators matched exactly at the end of tests. TIFS 341 39,630,083 VLIW-SIM 164 39,630,080 Table 1 H.263+: encoding 10 QCIF frames TIFS 241 10,516,855 VLIW-SIM 70 10,516,923 Table 2 - G.723.1 Coding-Decoding 20 frames 5.2 ST-Microelectronics ST200 In the following test the target architecture is the ST Microelectronics ST200 [12]. VLIWSIM performance and accuracy has been compared with ISS state of the art simulator (HP-ST Lx Instruction Set simulator). The Register files and the memory of the two simulators matched exactly at the end of tests. ISS 95 30,627,631 VLIW-SIM 97 30,365,220 Table 3 - H.263+ encoding 10 QCIF frames ISS 40 12,324,010 VLIW-SIM 42 12,311,062 Table 4 - G.723.1 Coding-Decoding 20 frames 5.3 VLIW-SIM performance The following Table resumes the VLIW-SIM performance in terms of Simulated Instruction per Second (sips) in the described case studies for the selected target architectures. VLIW-SIM (ST200) VLIW-SIM (TI-C62xx) G.723.1 783,752 sips 580,041 sips H.263+ 724,237 sips 328,731 sips Table 5 - VLIW-SIM performance in simulated instruction per second

It should be noted that the ST200 ISS is optimised for a single architecture. 6 Conclusions In this paper Instruction simulation issues for Hardware-Software Co-Design has been discussed. The Simulation environment VLIW SIM has been presented as solution for VLIW architecture and Multimedia application design support tools. Simulation Locality and Simulation oriented Hardware description were introduced to approach flexibility speed and Accuracy in ISA simulation. Two case studies have been presented to validate the described approach. Acknowledgments This research is part of the M 2 EDYA project in collaboration with ST-Microelectronics and Hewlett Packard References [1] Joseph A. Fisher. Very long instruction word architectures and the ELI-512, Proceedings of the 10th Annual International Symposium on Computer Architecture, Stockholm, Sweden, June 1983. [2] V. Bhaskaran, K. Konstantinides, Image and Video Compression Standards. Algorithms and Architecture, Second Edition Kluwer Academic Publishers 1998 [3] B.R. Rau, J.A. Fisher. Instruction Level Parallelism The Journal of Supercomputing 7 May 1993 [4] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, A. Wieferink, H. Meyr, A Novel Methodology for the Design of Application- Specific Instruction-Set Processors (ASIPs) Using a Machine Description Language, IEEE Transaction on Computer-Aied Design of Integrated Circuits and System, Vol. 20, N. 11, November 2001 [5] P. Faraboschi, G. Desoli, J.A. Fisher The Latest Word in Digital and Media Processing IEEE Signal Processing Magazine, March 1998 [6] R.B. Lee, M.D. Smith, Media Processing: A New Design Target, IEEE Micro, August 1996 [7] P. Lapsley, J. Bier, A. Shoham, E.A. Lee, DSP Processor Fundamentals: Architectures and Features, IEEE Press series on Signal Processing, 1996. [8] Berkeley Design Technology Inc, "VLIW Architectures for DSP", DSP World/ICSPAT, Orlando Florida, November 1999 [9] I. Barbieri, M. Bariani, M. Raggio, C6XSIM: A VLIW Architecture Simulation Innovative Approach DCIS 99, Palma de Maiorca Spain, November 1999 [10] TI, TMS320C62x/C67x CPU and Instruction Set, Reference Guide, 1998 [11] TI, TMS320C64x Technical Overview, September 2000 [12] P. Faraboschi, J. Fisher, G. Brown, G. Desoli, F. Homewood, Lx: A Technology Platform for Customizable VLIW Embedded Processing, ISCA Vancouver, Canada June 2000. [13] K. Olukotun, M. Heinrich, D. Ofelt, Digital system simulation: Methodologies and examples, Proc. Design Automation Conf., June 1998, pp. 658 663 [14] J. Rowson, Hardware/Software co-simulation, Proc. Design Automation Conference, June 1994, pp. 439 440. [15] I. Barbieri, M. Bariani, M. Raggio, "A VLIW architecture simulator innovative approach for HW- SW co-design" ICM000 - International Conference on Multimedia and Expo 2000 July 2000, New York City. [16] R.K. Gupta G. De Micheli Hardware-Software cosynthesis for digital systems IEEE design and test of Computers, September 1993 [17] ITU-T Recommendation H.263, Video coding for low bitrate communication, Feb. 1998 [18] ITU-T Recommendation G.723.1, Dual rate speech coder for multimedia communication transmitting at 5.3 and 6.3 kbit/s, October 1995 [19] S. M. Mishra, A. Balaram, Efficient Hardware- Software Co-Design for the G.723.1 Algorithm Targeted At VoIP Applications, ICME New York City US, August 2000