How to make true 3D-TSV IC application --Spreading 3D-TSV IC technologies, but not followed by major applications Meisei University Collaborative Research Center Kanji Otsuka We still not find major application with TSV interconnection structure. • As our recognition, the main figure of merit on TSV structure is avoiding from the 2-D restriction provided by 3-D interconnections. • Is the figure of merit collect? • We should again check the concept of this main figure of merit toward making major applications. 2 SEMICON Taiwan 2011 Kanji Otsuka, Meisei University 1. TSV diameter: still very large for interconnection. 2D interconnection Waste of active and 2D wiring area Even if we chose the size of 2um dia. TSV Si substrate 3 Kanji Otsuka, Meisei University TSV would not get down with wiring limitation. TSV advantage is rather in 3D structure. TSV can provide approximately 2 more wiring layers prevented with wiring length prolong Current technology: 6 or 10 metal layer Si substrate TSV 4 Kanji Otsuka, Meisei University 2. Trade-off issue between TSV aspect ratio and intrinsic gettering layer In case of Via-last Si substrate Loss of intrinsic gettering layer from when wafer thickness is 50um or less. TSV Thinning edge IG Layer Kanji Otsuka, Meisei University 5 3. Difficult solving on Know-Good-Die issue at W2W, therefore needed redundancy implement Failure die 6 SEMICON Taiwan 2011 Kanji Otsuka, Meisei University 4. Difficulty in thermal issue on many stacking structure, then saving power required Si substrate TSV Si substrate integrated thermal energy TSV Si substrate TSV Kanji Otsuka, Meisei University 7 5. Effective function overcome cost issue 6. Other many restrictions under process and design technology: complexity increasing 8 8 Kanji Otsuka, Meisei University Summary of 3D-TSV restrictions Restriction and problem 1. Less area efficiency under wasting active and 2D wiring Task (Red characters are focused now) Find function and performance beyond TSV area penalty 2. Trade-off issue between TSV Improvement process came into aspect ratio and loss of IG layer view now 3. Difficulty on known-Good-Die Introduce W2C or C2C for production or made redundancy 4. Thermal issue limitation ; the most important issue for 3D Choosing power saving circuit and system ; need fundamental approach 5. Cost issue limitation Effective function and performance overcome cost 6. Complicate process and design Simple process and easy design methodology algorithm 9 SEMICON Taiwan 2011 Kanji Otsuka, Meisei University Several solutions have been announced. Trend seems to be still not enough now. (1) Tile or small block array through TSV interconnection are good for memory or image sensor system with wide band interconnection by several thousand TSVs. Redundant memory Core CPU or bus controller (2) Cache DRAM faces on CPU as providing large size cache with area saving. (3) Stacked closed function block including FPGA and core makes to scalable system with redundancy. Memory FPGA Core (4) Using silicon interposer with TSVs gets higher performance of 2D wiring. Memory Diagnostic-restoration Many core FPGA Si interposer (5) Memory stacked module and many small core stacked module connect with diagnosis-restoration and dynamic reconfiguration wiring module. This is 10 some of ideal system, however there is not any specified now. Kanji Otsuka, Meisei University Small number of TSVs in each tile or small block would make most effective structure. However, different function of tile would have different size and different connection requirement. Therefore it could not produce to efficient stack-up and interconnection. Naturally, an idea can be created as unified circuit in whole of system. Then we can make the tile structure efficiently. Neuron of our brain is unified function conjugated with logical processing and memory. Can we make such circuit by CMOS unit gate? Neuron and axon network 11 Kanji Otsuka, Meisei University Dynamic reconfiguration algorism by unified function block Increasing and decreasing depend on cache hit ratio Array of mat Efficient communication between neighbor block with high band width and high processing rate Adding cache by new generated logic Cache surrounded the logic Logic When job capacity increasing Cache surrounded the logic Expanding Logic Multi task with shared cache Kanji Otsuka, Meisei University 12 Unified circuit! Easy to make as following configuration. SRAM can change to any function even wiring connection. For memory For logic Changed by mode selector 13 Kanji Otsuka, Meisei University Unified like algorithm is already current in FPGAs. FPGA ○ Logic block: LUT (SRAM) and simple logic with relative small driver ○ Switching block: FF+switch ○ Connecting block: wiring Above is not true unified block that is composed by primitive logic and additional memory (both are of hard structure) Toward unified circuit (before slide) ○ Logic block: SRAM with mode selector ○ Memory block: SRAM with mode selector ○ Switching block: SRAM ○ basic cell connection (wiring): SRAM 0:off 1:on FF 0 1 0 0 0 Unified ! However poor efficiency on switching block and wiring by SRAM Then, arrange optimum basic cell size and cluster size ○ Logic block: SRAM with mode selector with relative small driver ○ Memory block: SRAM with mode selector with relative small driver ○ Cluster connection: bus with driver (through TSV) 0 Switching Block I/O COUT LUT architecture of Xilinx Virtex-5 B6 B5 B4 B3 B2 B1 BMUX B 5-LUT FF 5-LUT MUX BQ Connecting Block Logic Block FPGA’s Basic Cell 6-LUT BX CIN Kanji Otsuka, Meisei University 14 Now I introduce our memory-logic conjugate system SRAM based 8bit Processor An application of Memory-Logic Conjugate System (MLCS) in Smallest model Meisei University Yoichi Sato Kanji Otsuka Hitachi ULSI Systems Masahiro Yoshida Kanji Otsuka, Meisei University 15 The Outlook of the Memory - Logic Conjugate System (MLCS) 1. Solving the problem of low band width between memories and logics. (because of memory to be logic itself) 2. Effective architecture: dynamic reconfiguration can done by only rewriting register. (because of memory to be logic itself) 3. High speed operation: miscellaneous registers in a basic cell can be used by dynamic reconfiguration. (a basic cell itself can be programmable) 4. Suitable for 3D-TSV assembly and scalable made by small block configuration. 5. Low power: no need I/O circuits between Logic circuits and SRAMs. And access path can be saved. 16 Kanji Otsuka, Meisei University Structure of Basic Cell Simple operation can be programmable by using rich internal registers. Bus wiring can be routing on the memory area (about 70%), which can save area. (4bit×4) Control bus(CY etc) (4bit×2) (4bit×2) :Outputs of Route Configuration register or Mode register :reconfiguration bus (4bit each) (4bit×4) Sub control bus (8bit) (4bit×4) Ch. set register :Control signal (1bit each) Mode set register :address, data (4bit each) :write command bus ADD(Write) ADD (4bit×4) Input control circuit (mode change control & channel control) SRAM(LUT) DIN (4bit×4) D 256W×8bit CK CE R/W Output control circuit (register, switch, etc Control) (4bit REG x8) (4bit×4) (4bit×4) 17 Kanji Otsuka, Meisei University Operation mode of basic cell (Memory-logic conjugate cell) Rich operation modes can construct flexible and variable systems. Operation mode Through Access mode (= initial mode) S/R=“L” (reset mode) System mode S/R=“H” Memory mode Logic mode External memory mode Route Configuration Register Mode (making LUT) External memory mode Route Configuration Register Mode (making LUT for dynamic reconfiguration) Arithmetic operation mode Logic library mode (Macro-cell) Combinational Circuit mode For dynamic reconfiguration Internal memory mode Information Update mode for Route Configuration Register Route Configuration mode by Mode Register 18 Kanji Otsuka, Meisei University Outlook of MLCS structure Some size of cluster allocation matches to operation and logic density. Other Systems (including Cluster memory) Multiple bus Clk + Control signal Data( 8 bit×n ) Basic Cell Array Basic Cell Array ・・・・・・ ・・・・・・ CX Control Circuit +Bus I/F decoders Basic Cell ・・・・・・ Addresses n columns m rows ・・・・・・ Cluster memory CY (address space of Cluster memory) q bit 8 bit Extension address Memory – Logic Conjugate System (MLCS): Memory address Total system including some Cluster memories of B.C. 19 Kanji Otsuka, Meisei University Actual design of four basic cell configuration Area for TSVs Four basic cell Memory (SRAM) for testing 256W x 8bit x 4cell 20 Kanji Otsuka, Meisei University MLCS memory space Memory space is adjustable for dynamic reconfiguration function. Cluster memory 2 256w 256w Cluster memory 3 256w Cluster memory 1 256w 256w 256w Cluster memory n :logic mode :memory mode 256w Channel set register 256w Bus switch For memory space 256w 256w For memory space 256w Memory space of LSI Basic cell Memory space of MLCS Kanji Otsuka, Meisei University 21 Cluster memory layout example in single 8 bit ALU ● Area is about 330X330um2 @90nm process (One Cluster) PC Adder & 8bit ALUs (one resource shared) (decoder control) Logical judgment circuit Basic cell decoder 00 shifter(8bit ) 01 10 Y (2)structure of 8bit ALU .To enable 2-cycle 16bit addition, new type of adder with carry code input is introduced (which uses 4 Basic Cells). 11 00 01 10 (Note) (1)Program counter:16bit .2-cycle operation in case of overflow in address operation .1-cycle operation (without overflow) (by using 8bit ALU) 11 X Instruction decoder Basic cell array Reserve part Program memory (512w×8b) 22 Kanji Otsuka, Meisei University 22 Performance comparison between pure logic and MLCS Operation speed of processor mode MLCS (8bit) Band Pure frequency logic** NonFour (8/32bit) parallel parallel* Maximum 4GHz 1GHz MLCS (32bit) Nonparallel Four parallel* 1GHz 4GHz 4GHz Mean rate ? (1GHz) (4GHz) (1GHz) (3GHz ) Note: *Incase of 50% independency between four threads **One thread in pure logic that is superior than the SRAM based MLCS Program command + data Four multi-thread processing Rearrangement Power consumption on the same logic with one thread Power Pure logic MLCS FPGA Relative ratio 1 2 20 Area consumption on the same logic with different peripheral circuit α , β , γ α , γ : constant size with some Area Pure logic MLCS FPGA allowance design Ratio β : dynamic size with minimum 1+ α 7+β 30 + γ design Pure logic would be the best for processing, however MLCS can operate dynamic reconfiguration mode and memory function. Meisei University Confidential 23 Configuring from cluster to mat structure controlled by synchronous clock decoders Basic Cell Array =Cluster Control Circuit Control Circuit decoders Control Circuit Control Circuit decoders decoders decoders Basic Cell Array =Cluster decoders A mat (unit processor element) Basic Cell Array =Cluster decoders decoders Basic Cell Array =Cluster Cluster memory Position of clock supply Space for wiring and TSVs connecting between clusters in a mat 24 Kanji Otsuka, Meisei University Clock timing image for synchronous and asynchronous Sub-Processor Master clock ; asynchronous on mat-to-mat cluster Dynamic access by asynchronous clock on mat-to-mat with dynamic reconfiguration Hit signal from neighbor mat by the header of a packet Clock synchronous cube, we said Mat Kanji Otsuka, Meisei University 25 Increasing and decreasing depend on cache hit ratio Array of mat Dynamic reconfiguration algorism Adding cache by new generated logic Cache surrounded the logic Logic Adjacent addressing can save the latency within 1clock within synchronous cube When job capacity increasing Cache surrounded the logic Expanding Logic Of course, mat itself can dynamically set number of registers depend on requirement. Mat also can include penetrated caches inside. Multi task with shared cache 26 Kanji Otsuka, Meisei University 27 Kanji Otsuka, Meisei University Other approach in technical papers. Memory structured LUT presented by Masayuki Sato, RECONF Symposium 2006.9 One idea introduce as half quadrate interconnection memory based logic circuit in random array, however still memories are consumed for interconnection / switching. Rearrangement of unit tile is developing now by Mr. Sato and Prof. Hironaka from Hiroshima City University. 28 Kanji Otsuka, Meisei University Next significant issue is power saving. Is there drastic power saving method? Yes we have one idea. start 1 2 I = mv , K = mv 2 stop Radiation of heat 29 SEMICON Taiwan 2011 Kanji Otsuka, Meisei University Physics of power consumption 1 start I = mv , K = Power consumption on unit circuit RC遅延回路 Radiation of heat stop Voltage Current On current CT Ron RI CI Off current CL imax 0 Vdd Current to waste = Ron We should recover it. Q[C] = (CT + C L + C I ) ⋅ Vdd P[ W ] = 2 mv 2 1 2 (CT + C L + C I ) ⋅ Vdd 2 ⎛ ⎛ ⎞ ⎛ ⎞⎞ t t ⎟⎟ ⎟⎟ ⎟ , v f = Vdd exp⎜⎜ vr = Vdd ⎜⎜1 − exp⎜⎜ ⎟ ( ) ( ) R C R C ⎝ on sum ⎠ ⎝ on sum ⎠ ⎠ ⎝ ⎛ ⎛ ⎞ ⎛ ⎞⎞ t t ⎟⎟ , idis = imax ⎜1 − exp⎜⎜ ⎟⎟ ⎟ ich = imax exp⎜⎜ ⎟ ⎜ ⎝ Ron (Csum ) ⎠ ⎝ Ron (Csum ) ⎠ ⎠ 30 ⎝ Kanji Otsuka, Meisei University One of solution can be found on electric motor car operation. 1 2 Charge by brake I = mv , K = mv 2 Discharge Sports EV battery However, transistor can not recover the active carrier energy, we all would think. Is that true? G S P-type S D N-type D N-type N-type Active carriers on conduction band 0V G Vacancy layer P-type association N-type Diffusing and shifting to valence band Generating heat 31 Kanji Otsuka, Meisei University Huge power!! Power supply building K computer, performance : 10PFLOPS, Largest computer in the world at now 32 Kanji Otsuka, Meisei University Recovering signal energy method: Active carrier reused on differential CMOS circuit Output characteristic impedance Z0=100Ω Differential MOS’s in the same well Input characteristic impedance Z0=100Ω Source Gate Drain Key structure is that differential MOS transistors are positioned in the same well. Space 1um Differential pair 4.3um 2um 11.5um 7.2um 5um 33 Kanji Otsuka, Meisei University Recovering signal energy method: Active carrier reused on differential CMOS I/O Driver VDD VDD VDD VDD INP OUTN INN OUTP Arrangement differential transistors in the same well VRF - Input ESD Inverter IN-Negative IN-Positive + P n+ P + P Output ESD Current control P N N N-Well p+ N N P-Well P_SUB Kanji Otsuka, Meisei University 34 Unit cell ray-out configuration ESD Inverter ESD Kanji Otsuka, Meisei University 36 37 SEMICON Taiwan 2011 Kanji Otsuka, Meisei University Initial After inversion Transient Forced releasing carrier by capacitance change Moving free carrier to other capacitance by voltage sink Paired switch in same well Discharge limiting inductance at carrier rejection through source or drain Set condition is as mobility of hole=4×102[cm2/Vs] at 300k in carrier density 1014~1015[cm-3], and Vdd=1.8V. Then drift speed D=7.2×102 [cm2/s] is counted. When carrier traveling length is 10μm, 0.001cm=√Dt=√2×102・t is derived, thus t=1.3×109s=1.3ns is given comparing with longer time for our object rise time of pulse 100ps (3GHz equivalent). But electron travel time is 130ps that is our order of rise time. Kanji Otsuka, Meisei University 38 Carrier reuse driver chip 39 Kanji Otsuka, Meisei University R for current measurement Cip=0.47pF Cwel=1.56pF Flip chip bonding Cip=0.47pF Terminator 100ohm Differential probing Z0=100ohm Cin=0.45pF Cin=0.45pF Substrate wiring length for differential output; 8mm Z0=100Ω Differential input “0.18um node” conventional CMOS process Power current measurement from the voltage drop at 4.7ohm series resistance. Z0=100ohm 0.25mm length IC chip We can save the power by carrier reused circuit. 14 Differential inverter current depending on frequency Current [mA] Current[mA] 12 Reduction!! 10 8 DC current by current Vdd control transistors and Calculation Calculation current current by by cap. cap. clumping drivers on others 4 Depressed swing height region Ohmic current Ohmic current 2 Current Current at Vdd at Vdd 1.8V 1.8V 6 0.001 0.01 0.1 Frequency [GHz] 0 1 Kanji Otsuka, Meisei University 10 40 Random pulse eye pattern shows high speed even in 0.18um process node. 4mm termination Probe point FR-4 substrate:transmission line =100Ω ESDZ=50Ω VCC=1.8V termination=100Ω、input swing1.8V 8Gbps 9Gbps 11Gbps 12Gbps Kanji Otsuka, Meisei University 10Gbps 41 More effective carrier reuse circuit structure is in double gate Fin type. Drain 2 Gate 2 Source 2 Insulating layer Drain1 Gate1 Source1 drain Gate source 42 Kanji Otsuka, Meisei University Power saving image in each device used by carrier reuse transistor circuit Relative power consumption level Device Function Initial / Carrier reuse Power saving ratio (1) Pure logic ALU 15 to 30 % Peripheral I/O (2) DRAM memory mat 10 to 30 % Addressing I/O (3) SRAM Memory mat 25 to 45 % Addressing I/O (4) MLCS M/L mat with small Addressing cell I/O Less than SRAM due to small cell Applicable on all differential circuit Kanji Otsuka, Meisei University 30 to 50 % 43 Summary for a solution Previous listed task Solution 1. Find function and performance beyond TSV area penalty Tile or small block array structure through TSV interconnection 3. Made redundancy Unified circuit such as memorylogic conjugation system 4. Choosing power saving circuit and system Carrier reuse transistor circuit 5. Effective function and performance turning over cost Unified circuit such as memorylogic conjugation system 6. Easy design algorithm Unified circuit such as memorylogic conjugation system As like my presentation example, more fundamental physics and algorithm concept should be developed for 3D structure with TSVs. 44 Kanji Otsuka, Meisei University
© Copyright 2024