Digital Integrated Circuits (83-313) Lecture 8: Memory Peripherals Semester B, 2016-17 Lecturer: Dr. Adam Teman TAs: Itamar Levi, Robert Giterman 20 May 2017 Disclaimer: This course was prepared, in its entirety, by Adam Teman. Many materials were copied from sources freely available on the internet. When possible, these sources have been cited; however, some references may have been cited incorrectly or overlooked. If you feel that a picture, graph, or code example has been copied from you and either needs to be cited or removed, please feel free to email adam.teman@biu.ac.il and I will address this as soon as possible.
2 Lecture Content
3 Memory Peripherals Overview
Memory Architecture ADDA-1 : ADDM Row Decoder ADD M-1 : ADD 0 Bit Line Sense Amplifiers /Drivers Column Decoder Storage Cell Word Line C 2 M Memory Size: W Words of C bits =W x C bits Address bus: A bits W=2 A Number of Words in a Row: 2 M Multiplexing Factor: M Number of Rows: 2 A-M Number of Columns: C x 2 M Row Decoder: A-M 2 A-M Column Decoder: M 2 M 4 Input/Output (C bits)
5 Memory Timing: Definitions
Row Decoder Major Peripheral Circuits Bit Line Storage Cell Row Decoder Column Multiplexer Sense Amplifier AW-1 : AM Word Line Write Driver Precharge Circuit Sense Amplifiers /Drivers C 2 M A M-1 : A 0 Column Decoder 6 Input/Output (C bits)
7 Row Decoder Design
Row Decoders A Decoder reduces the number of select signals by log 2. Number of Rows: N Number of Row Address Bits: log 2 N 8
Row Decoders Standard Decoder Design: Each output row is driven by an AND gate with k=log 2 N inputs. Each gate has a unique combination of address inputs (or their inverted values). For example, an 8-bit row address has 256 8-input AND gates, such as: WL WL 0 A7 A6 A5 A4 A3 A2 A1 A0 255 7 6 5 4 3 2 1 0 A A A A A A A A 9 NOR Decoder: DeMorgan will provide us with a NOR Decoder. In the previous example, we ll get 256 8-input NOR gates: WL0 A7 A6 A5 A4 A3 A2 A1 A0 WL255 A7 A6 A5 A4 A3 A2 A1 A0
How should we build it? Let s build a row decoder for a 256x256 SRAM Array. We need 256 8-input AND Gates. Each gate drives 256 bitcells We have various options: Which one is best? 10
Reminder: Logical Effort t t p EF pd, i pinv i i EF LE f LE i i i i b C i C in, i1 in, i PE F LE B L LEi b C i in,1 C N N opt i i EF PE F LE b N log PE log F LE B opt EF EF opt opt t t p EF t p N PE N pd pinv i i pinv i 11
Problem Setup For LE calculation we need to start with: Output Load (C L ) Input Capacitance (C in ) Branching (B) What is the Load Capacitance? 256 bitcells on each Word Line C 256C C WL Cell Wire 12 Let s ignore the wire for now What is the Input Capacitance? Let s assume our address drivers can drive a bit more than a bitcell, so: C 4C in, addr _ driver Cell
Problem Setup What is the Branching Effort? Lets take another look at the Boolean expressions: WL WL A A A A A A A A 0 7 6 5 4 3 2 1 0 A A A A A A A A 255 7 6 5 4 3 2 1 0 We see that half of the signals use A i and half use A i! So each address driver drives 128 8-input AND gates, but only one is on the selected WL path. C C ; C 127C B on path nand off path nand add _ driver Con path Coff path Cnand 127Cnand 128 C C on path nand 13
Number of Stages Altogether the path effort is: CWL PE LE B F LE bi LE 128 C address 13 LE 8k 2 LE 256C 4C Cell Cell The best case logical effort is LE 1 So the minimum number of stages for optimal delay is: PE N opt 2 13 13 log3.6 2 7 That s a lot of stages! 14
So which implementation should we use? The one with the minimum Logical Effort: LE 10 3 1 10 3; p 8 1 9 LE 2 5 3 10 3 p 4 2 6 LE 4 3 5 3 4 3 1 80 27; p 2 2 2 1 7 LE 43 3 2.37; p 2313 9 15
New optimal number of Stages So now we can calculate the actual path effort: PE F b LE N opt 13 2.37 2 19.418 log PE 7.7 3.6 i i k We could add another inverter or two to get closer to the optimal number of stages 16
Implementation Problems Address Line Capacitance: Our assumption was that C in,addr_driver =4C cell. But each address drives 128 gates That s a really long wire with high capacitance. This means that we will need to buffer the address lines This will probably ruin our whole analysis... Bit-cell Pitch: Each signal drives one row of bitcells. How will we fit 8 address signals into this pitch? 17
Predecoding - Concept Solution: Let s look at two decoder paths: WL 254, WL 255 We see that there are many shared gates. So why not share them? 18 For instance, we can use the purple output for both gates
Predecoding - Method How do we do this? If we look at the final Boolean expression, it has combinations of groups of inputs. By grouping together a few inputs, we actually create a small decoder. Then we just AND the outputs of all the pre decoders. For example: Two 4:16 predecoders 19 D dec A, A, A, A ; E dec A, A, A, A ; 0 1 2 3 4 5 6 7 WL D E ; WL D E ; WL D E ; 0 0 0 255 15 15 254 15 14
Predecoding - Example Let s look at our example: D dec A, A, A, A 0 1 2 3 E dec A, A, A, A 4 5 6 7 WL D E 0 0 0 WL D E 255 15 15 WL D E 254 15 14 20 What is our new branching effort? As before, each address drives half the lines of the small decoder. Each predecoder output drives 256/16 post-decoder gates. Altogether, the branching effort is: B b 16 256 addr _ driver bpredecoder 128 2 16 Same as before!
Predecoding - Solution Why is this a better solution? Each Address driver is only driving four gates less capacitance. We saved a ton of area by sharing gates. We can Pitch Fit 2-input NAND gates. 21
Another Predecoding Example We can try using four 2-input predecoders: This will require us to use 256 4-input NAND gates. 22
How do we choose a configuration? Pitch Fitting: 2-input NANDs vs. 4-input NAND. Switching Capacitance: How many wires switch at each transition? Stages Before the large cap: Distribution of the load along the delay. Conclusion: Usually do as much predecoding as possible! WL 0 WL 0 WL 1 WL 1 4 4 4 4 16 16 WL 127 WL 127 2 4 2 4 24 2 4 4 16 4 16 23 A 0 A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 0 A 1 A 2 A 3 A 4 A 5 A 6 A 7
Alternative Solution: Dynamic Decoders 2-input NOR decoder 2-input NAND decoder 24
25 Column Multiplexer
Column Multiplexer First option PTL Mux with decoder Fast only 1 transistor in signal path. Large transistor Count A1 A0 B0 B1 B2 B3 Y 26
4 to 1 tree decoder Second option Tree Decoder For 2k:1 Mux, it uses k series transistors. Delay increases quadratically No external decode logic big area reduction. 27
28 Combining the Two
29 Precharge and Sense Amp
Precharge Circuitry Precharge bitlines high before reads bit bit_b Equalize bitlines to minimize voltage difference when using sense amplifiers bit bit_b 30
Sense Amplifiers t p = C ---------------- DV I av make DV as small as possible large small Idea: Use Sense Amplifer small transition s.a. input output 31
Differential Sense Amplifier Non-clocked Sense Amp has high static power. Clocked sense amp saves power Requires sense_clk after enough bitline swing Isolation transistors cut off large bitline capacitance 32
Further Reading Rabaey, et al. Digital Integrated Circuits (2 nd Edition) Elad Alon, Berkeley ee141 (online) Weste, Harris, CMOS VLSI Design (4 th Edition) 34