High-Throughput Low-Energy Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism

Similar documents
Fully Parallel 6T-2MTJ Nonvolatile TCAM with Single-Transistor-Based Self Match-Line Discharge Control

Reducing Energy in a Ternary Cam Using Charge Sharing Technique

642 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 4, APRIL 2015

THE content-addressable memory (CAM) is one of the most

CHAPTER 5 DESIGN AND ANALYSIS OF COMPLEMENTARY PASS- TRANSISTOR WITH ASYNCHRONOUS ADIABATIC LOGIC CIRCUITS

Low Power TCAM Design And Simulation Rahul Nigam Department of electronics and communication, NIT, Calicut- India.

Design of an efficient NOR Content Addressable Memory Bit cell Using memristor and MT-CMOS in FinFET Technology

DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM

TCAM Core Design in 3D IC for Low Matchline Capacitance and Low Power

An Energy Efficient Match-Line Sensing Scheme for High-Speed and Highly-Reliable Ternary Content Addressable Memory

ACONTENT-ADDRESSABLE memory (CAM) is a

Memory (Part 1) RAM memory

Variation-tolerant Non-volatile Ternary Content Addressable Memory with Magnetic Tunnel Junction

Topic 6. CMOS Static & Dynamic Logic Gates. Static CMOS Circuit. NMOS Transistors in Series/Parallel Connection

A Comparative Simulation Study of Four Multilevel DRAMs

Chapter 3 DESIGN OF ADIABATIC CIRCUIT. 3.1 Introduction

Design and Evaluation of two MTJ-Based Content Addressable Non-Volatile Memory Cells

Duty-Cycle Shift under Asymmetric BTI Aging: A Simple Characterization Method and its Application to SRAM Timing

A Low-Power SRAM Design Using Quiet-Bitline Architecture

A PCM-based TCAM cell using NDR

A 12-bit Interpolated Pipeline ADC using Body Voltage Controlled Amplifier

Design of New Full Swing Low-Power and High- Performance Full Adder for Low-Voltage Designs

Power Spring /7/05 L11 Power 1

EE 330 Lecture 44. Digital Circuits. Other Logic Styles Dynamic Logic Circuits

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

90% Write Power Saving SRAM Using Sense-Amplifying Memory Cell

Mohit Arora. The Art of Hardware Architecture. Design Methods and Techniques. for Digital Circuits. Springer

! Sequential Logic. ! Timing Hazards. ! Dynamic Logic. ! Add state elements (registers, latches) ! Compute. " From state elements

Opportunities and Challenges in Ultra Low Voltage CMOS. Rajeevan Amirtharajah University of California, Davis

Design and Implement of Low Power Consumption SRAM Based on Single Port Sense Amplifier in 65 nm

CMOS VLSI Design (A3425)

Analysis of Low Power-High Speed Sense Amplifier in Submicron Technology

EE 330 Lecture 42. Other Logic Styles Digital Building Blocks

Electronic Circuits EE359A

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Design And Implementation of Pulse-Based Low Power 5-Bit Flash Adc In Time-Domain

A Low Power Array Multiplier Design using Modified Gate Diffusion Input (GDI)

EE 330 Lecture 43. Digital Circuits. Other Logic Styles Dynamic Logic Circuits

Jan Rabaey, «Low Powere Design Essentials," Springer tml

High-Performance of Domino Logic Circuit for Wide Fan-In Gates Using Mentor Graphics Tools

Synchronous Mirror Delays. ECG 721 Memory Circuit Design Kevin Buck

I DDQ Current Testing

A Ternary Content Addressable Cell Using a Single Phase Change Memory (PCM)

Figure 1 Typical block diagram of a high speed voltage comparator.

EE 330 Lecture 44. Digital Circuits. Dynamic Logic Circuits. Course Evaluation Reminder - All Electronic

AND 5GHz ABSTRACTT. easily detected. the transition. for half duration. cycle highh voltage is send. this. data bit frame. the the. data.

A Low-Noise Self-Calibrating Dynamic Comparator for High-Speed ADCs

Timing and Power Optimization Using Mixed- Dynamic-Static CMOS

電子電路. Memory and Advanced Digital Circuits

High Speed Communication Circuits and Systems Lecture 14 High Speed Frequency Dividers

! Review: Sequential MOS Logic. " SR Latch. " D-Latch. ! Timing Hazards. ! Dynamic Logic. " Domino Logic. ! Charge Sharing Setup.

CHAPTER 6 PHASE LOCKED LOOP ARCHITECTURE FOR ADC

EE 330 Lecture 43. Digital Circuits. Other Logic Styles Dynamic Logic Circuits

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

Low-Power CMOS VLSI Design

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

A Low-Power High-speed Pipelined Accumulator Design Using CMOS Logic for DSP Applications

EE241 - Spring 2013 Advanced Digital Integrated Circuits. Announcements. Lecture 16: Power and Performance

Leakage Power Reduction in 5-Bit Full Adder using Keeper & Footer Transistor

Mixed Synchronous/Asynchronous State Memory for Low Power FSM Design

Low Power Design of Successive Approximation Registers

Variation in Delays and Power Dissipation in 3-8 line Decoder with Respect to Frequency

Summary Last Lecture

CAPACITORLESS LDO FOR HIGH FREQUENCY APPLICATIONS

Lecture 19: Design for Skew

CHAPTER 3 NEW SLEEPY- PASS GATE

Summary Last Lecture

CHAPTER 3 PERFORMANCE OF A TWO INPUT NAND GATE USING SUBTHRESHOLD LEAKAGE CONTROL TECHNIQUES

A Survey of the Low Power Design Techniques at the Circuit Level

Source Coding and Pre-emphasis for Double-Edged Pulse width Modulation Serial Communication

Investigation on Performance of high speed CMOS Full adder Circuits

Energy-Recovery CMOS Design

LOW POWER NOVEL HYBRID ADDERS FOR DATAPATH CIRCUITS IN DSP PROCESSOR

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Wide frequency range duty cycle correction circuit for DDR interface

L15: VLSI Integration and Performance Transformations

EECS 427 Lecture 22: Low and Multiple-Vdd Design

precharge clock precharge Tpchp P i EP i Tpchr T lch Tpp M i P i+1

EEC 216 Lecture #10: Ultra Low Voltage and Subthreshold Circuit Design. Rajeevan Amirtharajah University of California, Davis

THE power/ground line noise due to the parasitic inductance

L15: VLSI Integration and Performance Transformations

Propagation Delay, Circuit Timing & Adder Design. ECE 152A Winter 2012

Domino CMOS Implementation of Power Optimized and High Performance CLA adder

Propagation Delay, Circuit Timing & Adder Design

Single-Stage Vernier Time-to-Digital Converter with Sub-Gate Delay Time Resolution

SRAM Read Performance Degradation under Asymmetric NBTI and PBTI Stress: Characterization Vehicle and Statistical Aging

Near-threshold Computing of Single-rail MOS Current Mode Logic Circuits

A 6-bit Subranging ADC using Single CDAC Interpolation

NOVEMBER 28, 2016 COURSE PROJECT: CMOS SWITCHING POWER SUPPLY EE 421 DIGITAL ELECTRONICS ERIC MONAHAN

TMC Channel CAMAC Multi-Hit TDC. Module Manual

EE 330 Lecture 44. Digital Circuits. Ring Oscillators Sequential Logic Array Logic Memory Arrays. Final: Tuesday May 2 7:30-9:30

Energy Recovery for the Design of High-Speed, Low-Power Static RAMs

! Is it feasible? ! How do we decompose the problem? ! Vdd. ! Topology. " Gate choice, logical optimization. " Fanin, fanout, Serial vs.

Digital Electronics Part II - Circuits

Module -18 Flip flops

Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits

The challenges of low power design Karen Yorav

CS61c: Introduction to Synchronous Digital Systems

Lecture 16. Complementary metal oxide semiconductor (CMOS) CMOS 1-1

Adiabatic Logic Circuits for Low Power, High Speed Applications

Transcription:

18 th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), May 7-9, 2012, Copenhagen! High-Throughput Low-Energy Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism Dr. Naoya Onizawa Postdoctoral Fellow, McGill Univerisity naoya.onizawa@mail.mcgill.ca Collaborators: S. Matsunaga, V. Gaudet, and T. Hanyu Acknowledgements This research was supported by Japan Science Technology Agency (JST) Development of Dependable Network-on-Chip Platform in Core Research for Evolutional Science and Technology (CREST)

Content-Addressable Memory (CAM)! Associative memory! Parallel searching! Applications! Cache, Virus checking! Packet forwarding (40G, 100Gbps) Search word Match match Location (Address) # entries word length 2

Hardware-Implementation Issue Speed restriction! Packet length - 32bit (IPv4), 128,144bit (IPv6) Search word 144-b data " Large matching delay Pipelined approach K. Pagiamtzis, et al (JSSC 04 vol.39-9) Search word " Large area and power dissipation Goal: High-throughput low-overhead CAM 3

Concept! Operate as comparable to small CAM Search If Match Slow access High-speed access to small CAM Large CAM Word 1 Word 2 Word 3 Large Large Large Small matching delay time Hide large matching delay to improve throughput 4 In most cases, search words match

Approach! Assign search words to unused blocks After searching first few bits, most blocks are mismatched " If unused blocks are found, it doesn t need to wait to search new words until the current search is complete. Pre-computation block (Find unused blocks before sending) 5.57x higher throughput at 8% cost of area 5 Unused In use Unused Unused Partitioning Search new words after searching few bits

Table of Contents " Introduction to content-addressable memory " Overlapped search mechanism " Word overlapped search " Phase overlapped processing " Hardware implementation " Evaluation " Conclusion and future prospect 6

CAM word blocks 128-144 bits (IPv6) stored word 0 stored word 1 stored word w-1 match lines encoder match location log 2 w bits Operation 1. Search all words in parallel 2. Find a matched word block Input controller n bits Search word search lines Word-parallel search in single cycle 7 3. Output a matched location (address)

CAM Word Block Search word (n bits) 1 1 1 1 0 1 1 1 Sense amplifier Conditions! Match! Stored word (n bits) Series of CAM cells (NAND-type CAM) Throughput determined by word length in conventional CAM Long word length degrades throughput 8

CAM Characteristics Matching probability of word blocks after k-bit search is! p matched = 1 $ # & " 2 % Most word blocks are not used after k-bit search (We set k=8 in the hardware implementation) " Use unused blocks to improve throughput k Most word blocks are unused (mismatched). 9

Word Overlapped Search (WOS) CAM architecture based on segmentation method k-bit 1st-stage sub-word block ML1 0 ML1 1 CAM block (self-timed circuit) MLS1 0 MLS1 1 (n-k)-bit 2nd-stage sub-word block ML2 0 ML2 1 1. Partition word block to: a) small k-bit block and b) large (n-k)-bit block by segmentation block ML1 w-1 MLS1 w-1 Segmentation circuit ML1 w-1 2. Segmentation block stores Its k-bit matched result Input controller word block SL1 SL2 (synchronous circuit) 3. Latter block operates when the first block matches Word block partitioned by segmentation block 10

WOS operation Match Input connects all word blocks Search sub-word 1 1. First k-bit sub-word search 2. Some sub-word block matched 11

WOS operation Latter bits matching Match Match Assign to unused block Search sub-word 2 Search sub-word 1 After k-bit search, new search starts 12

WOS operation Still operates Match Match Latter bits matching (word 2) Match Search sub-word 3 Assign to unused block Search sub-word 2 Should not affect the old matched block How to assign search words to unused blocks? 13

Categorize Word Blocks Same stored k-bit data Group A 00000000 00000000 00000001 Group B Categorize based on the first k-bit stored word 14

Pre-Computation ctrl SL1 Search line registers k bits Search data SL2 Search line registers Search line registers (n-k) bits Comparator enable Mode controller Input controller (m=1) mode Compare m consecutive k-bit search words If they are different, they are in different groups (Category 1: fast mode) Otherwise, they are in the same group (Category 2: slow mode) Categorize search words using comparator 15

Timing Diagram (fast mode) Ctrl SL1 Data1 Data2 Data3 Data4 SL2 ML1 0 MLS1 0 ML2 0 ML1 1 MLS1 1 ML2 1 Data0 Data1 Data2 Data3 Match Hold Match Match Hold Send search words based on short delay T tst Consecutive words are assigned to unused blocks (a) High-speed searching based on T 1st 16

Timing Diagram (slow mode) Ctrl SL1 Data1 Data2 Data3 SL2 Data0 Data1 Data2 mode Slow fast ML1 0 Match Match MLS1 0 Hold Hold ML2 0 Match Two consecutive words use the same word block Wait until the current search is complete 17

Average Search Delay! Category 1 fast mode Send search words based on the first k-bit delay (T 1st )! Category 2 slow mode Send a new word after the current n-bit search is complete (T slow ) " " T sa = T 1st 1! m 1 % $ $ ' # # 2 & k % & ' +T " m " 1 % slow $ $ # 2 ' # & k % ' & 18

Table of Contents " Introduction to content-addressable memory " Overlapped search mechanism " Word overlapped search " Phase overlapped processing " Hardware implementation " Evaluation " Conclusion and future prospect 19

Word Circuit (precharge) NAND-type word circuit NAND-type CAM cell ML C C C Low Charge Low WL SL BL BL SL CAM cell! Dynamic logic! Series of pass transistors Charge capacitance on match line 20 ML! Match ON,! OFF

Word Circuit (evaluate) Match operation operation Search word High High 1 0 1 1 1 1 1 0 1 High 1 0 1 Low Discharge Discharging capacitance on match line " Output goes high Not discharging " Output remains low Match line remains high in mismatched case 21

Synchronous Control 1 0 1 (conventional) Evaluate Search data Clk High 1 0 1 Precharge Low 1 1 1 ML 0 Low 1 1 1 Low 1 0 1 ML 1 High 1 0 1 Low 1 1 0 ML 2 Low 1 1 0 Low All word circuits are controlled by a global clock signal 2 phases are required every search 22

Phase Overlapped Processing (POP) 1 0 1 Lclk 0 High 1 1 1 1 0 1 Lclk 1 High Lclk 2 High Low High Each circuit is independently controlled using local control signals! Matched word circuit Move on to precharge phase! ed word circuit Stay in evaluate phase 1 1 0 Low #Lowering switching activity of pre-charging signals ed blocks always process new word 23

WOS based POP Input can be changed at every phase Input are assigned to unused block (WOS scheme) 1 0 1 Evaluate Lctrl 0 High 1 1 1 High 1 1 1 Lctrl 1 High Low 1 1 1 High Low Precharge (Local) 1 0 1 High 1 0 1 Low 1 1 0 Lctrl 2 High High Low 1 1 0 Unused block can process without waiting precharge phase Searching words requires just 1 phase Low 24

Throughput Ratio Conventional T CS = 2T SS = 2(T reg +T 1st +T 2nd ) Proposed " " T CA = T SA = T 1st 1! m 1 % $ $ ' # # 2 &! T 1st Throughput ratio = T CS T CA = 2(T reg +T 1st +T 2nd ) T 1st k % & ' +T " m " 1 % slow $ $ # 2 ' # & k % ' & T SS T SA Synchronous search delay (evaluate phase) Asynchronous search delay 25

Table of Contents " Introduction to content-addressable memory " Overlapped search mechanism " Word overlapped search " Phase overlapped processing " Hardware implementation " Evaluation " Conclusion and future prospect 26

Circuit Implementation! 144-bit CAM word block with self-precharge circuit! Self-precharge circuit pre-charges after 2 nd stage is complete.! Hierarchical 2 nd stage block (17 local and 1 global match circuit) 8 bits 1st-stage sub-word circuits prec Local match circuit C C C C C C Segment (a) circuit NAND-type cell (b) x8 weak feedback transistor x8 ML1 0 prec ML1 0 LML14 0 Store local matched result 0 1 2 14 15 16 ML1 0 Self-precharge circuit (c) prec LML0 0 Global match circuit prec LML15 0 LML16 0 (d) ML2 0 It isn t affected by input changing Self-precharge circuit controls its word circuit 27

Simulated Waveforms Ctrl ML1 0 LML1 0 ML2 0 prec 0 ML1 1 LML1 1 ML2 1 prec 1 Voltage [V] 1.0 0 1.0 0 1.0 0 1.0 0 1.0 0 1.0 0 1.0 0 1.0 0 1.0 0 CAM operates based on T 1st (259ps) T CA =261ps 1.75 2.0 2.5 3.0 3.25 Time [ns] HSPICE simulation under a 90nm CMOS technology 28 Global match circuit uses only local matched result. After search is complete, self pre-charging is locally done.

Performance Comparison 256-word 144-bit binary CAM@90nm CMOS 1.5 Precharge Evaluate Conventional Proposed Cycle delay [ps] 1454 261 Average cycle time [ns] 1.0 0.5-64.1% -64.1% -82.0% -100% -64.1% Energy [fj/bit/ search] Match 0.0003 0.0006 Search 0.160 0.160 Control 0.103 0.001 Total 0.263 0.162 Area [Trs.] 372K 408K 0 Synchronous WOS WOS + POP Independent control reduces switching activity of pre-charging 5.57x throughput and 38% energy saving 29

Performance Comparison Cycle time [ns] 2.5 2 1.5 1 0.5 0 V DD =0.6V 0.7V 0.8V 0.9V 65nm@JSSC vol. 46-2 1.0V Conventional Hybrid(100nm) @JSSC 40-1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Energy dissipation [fj/bit/search] Better energy-delay product 30

Conclusion High-throughput low-energy CAM! Word overlapped search! Use unused word block! Assign based on pre-computation! Phase overlapped search! Independent control of each word block! Search without waiting for precharge " 5.57x throughput, 38% energy saving, 8% cost of area 31

Future Prospects! Circuit design considerations! Number of partitions! Timing robustness! Extend to Ternary CAM (TCAM)! Redesign input controller! Application specific design! Cache (TLB), virus checker 32