How to make true 3D-TSV IC application

How to make true 3D-TSV IC application
--Spreading 3D-TSV IC technologies, but not followed by
major applications
Meisei University
Collaborative Research Center
Kanji Otsuka
We still not find major application with
TSV interconnection structure.
• As our recognition, the main figure of merit on
TSV structure is avoiding from the 2-D restriction
provided by 3-D interconnections.
• Is the figure of merit collect?
• We should again check the concept of this main
figure of merit toward making major applications.
2
SEMICON Taiwan 2011
Kanji Otsuka, Meisei University
1. TSV diameter: still very large for interconnection.
2D interconnection
Waste of active and 2D wiring area
Even if we chose the size of 2um dia.
TSV
Si substrate
3
Kanji Otsuka, Meisei University
TSV would not get down with wiring limitation. TSV
advantage is rather in 3D structure.
TSV can provide approximately 2 more wiring
layers prevented with wiring length prolong
Current technology:
6 or 10 metal layer
Si substrate
TSV
4
Kanji Otsuka, Meisei University
2. Trade-off issue between TSV aspect ratio and
intrinsic gettering layer
In case of Via-last
Si substrate
Loss of intrinsic gettering layer from when
wafer thickness is 50um or less.
TSV
Thinning edge
IG Layer
Kanji Otsuka, Meisei University
5
3. Difficult solving on Know-Good-Die issue at
W2W, therefore needed redundancy implement
Failure die
6
SEMICON Taiwan 2011
Kanji Otsuka, Meisei University
4. Difficulty in thermal issue on many stacking
structure, then saving power required
Si substrate
TSV
Si substrate
integrated thermal energy
TSV
Si substrate
TSV
Kanji Otsuka, Meisei University
7
5. Effective function overcome cost issue
6. Other many restrictions under process and design
technology: complexity increasing
8
8
Kanji Otsuka, Meisei University
Summary of 3D-TSV restrictions
Restriction and problem
1. Less area efficiency under
wasting active and 2D wiring
Task
(Red characters are focused now)
Find function and performance
beyond TSV area penalty
2. Trade-off issue between TSV
Improvement process came into
aspect ratio and loss of IG layer view now
3. Difficulty on known-Good-Die
Introduce W2C or C2C for
production or made redundancy
4. Thermal issue limitation ; the
most important issue for 3D
Choosing power saving circuit and
system
; need fundamental approach
5. Cost issue limitation
Effective function and performance
overcome cost
6. Complicate process and design Simple process and easy design
methodology
algorithm
9
SEMICON Taiwan 2011
Kanji Otsuka, Meisei University
Several solutions have been announced. Trend seems to be still not enough now.
(1) Tile or small block array through TSV interconnection are good for memory
or image sensor system with wide band interconnection by several thousand
TSVs.
Redundant memory
Core CPU or bus controller
(2) Cache DRAM faces on CPU as providing large size cache with area saving.
(3) Stacked closed function block including FPGA and core makes to scalable
system with redundancy.
Memory
FPGA
Core
(4) Using silicon interposer with TSVs gets higher performance of 2D wiring.
Memory
Diagnostic-restoration
Many core
FPGA
Si interposer
(5) Memory stacked module and many small core stacked module connect with
diagnosis-restoration and dynamic reconfiguration wiring module. This is
10
some of ideal system, however there is not any specified now.
Kanji Otsuka, Meisei University
Small number of TSVs in each tile or small
block would make most effective structure.
However, different function of tile would have
different size and different connection
requirement. Therefore it could not produce to
efficient stack-up and interconnection.
Naturally, an idea can be created as unified
circuit in whole of system. Then we can make
the tile structure efficiently.
Neuron of our brain is unified function
conjugated with logical processing and
memory. Can we make such circuit by CMOS
unit gate?
Neuron and axon network
11
Kanji Otsuka, Meisei University
Dynamic reconfiguration algorism by unified function block
Increasing and
decreasing
depend on
cache hit ratio
Array of mat
Efficient communication
between neighbor block with
high band width and high
processing rate
Adding cache by
new generated
logic
Cache
surrounded
the logic
Logic
When job
capacity
increasing
Cache
surrounded
the logic
Expanding Logic
Multi task
with shared
cache
Kanji Otsuka, Meisei University
12
Unified circuit! Easy to make as following configuration.
SRAM can change to any function even wiring connection.
For memory
For logic
Changed by mode selector
13
Kanji Otsuka, Meisei University
Unified like algorithm is already current in FPGAs.
FPGA
○ Logic block: LUT (SRAM) and simple logic with relative small driver
○ Switching block: FF+switch
○ Connecting block: wiring
Above is not true unified block that is composed by primitive logic and
additional memory (both are of hard structure)
Toward unified circuit (before slide)
○ Logic block: SRAM with mode selector
○ Memory block: SRAM with mode selector
○ Switching block: SRAM
○ basic cell connection (wiring): SRAM
0:off
1:on
FF
0
1
0
0
0
Unified ! However poor efficiency on switching block and wiring by
SRAM
Then, arrange optimum basic cell size and cluster size
○ Logic block: SRAM with mode selector with relative small driver
○ Memory block: SRAM with mode selector with relative small driver
○ Cluster connection: bus with driver (through TSV)
0
Switching
Block
I/O
COUT
LUT architecture
of Xilinx Virtex-5
B6
B5
B4
B3
B2
B1
BMUX
B
5-LUT
FF
5-LUT
MUX
BQ
Connecting Block
Logic Block
FPGA’s Basic Cell
6-LUT
BX
CIN
Kanji Otsuka, Meisei University
14
Now I introduce our memory-logic conjugate system
SRAM based 8bit Processor
An application
of
Memory-Logic Conjugate System (MLCS)
in Smallest model
Meisei University
Yoichi Sato
Kanji Otsuka
Hitachi ULSI Systems
Masahiro Yoshida
Kanji Otsuka, Meisei University
15
The Outlook of
the Memory - Logic Conjugate System
(MLCS)
1. Solving the problem of low band width between
memories and logics.
(because of memory to be logic itself)
2. Effective architecture: dynamic reconfiguration can done
by only rewriting register. (because of memory to be
logic itself)
3. High speed operation: miscellaneous registers in a basic
cell can be used by dynamic reconfiguration. (a basic
cell itself can be programmable)
4. Suitable for 3D-TSV assembly and scalable made by
small block configuration.
5. Low power: no need I/O circuits between Logic circuits
and SRAMs. And access path can be saved.
16
Kanji Otsuka, Meisei University
Structure of Basic Cell
Simple operation can be programmable by using rich internal registers.
Bus wiring can be routing on the memory area (about 70%), which can save area.
(4bit×4)
Control bus(CY etc)
(4bit×2)
(4bit×2)
:Outputs of Route
Configuration
register
or Mode register
:reconfiguration bus
(4bit each)
(4bit×4)
Sub control bus (8bit)
(4bit×4)
Ch. set register
:Control signal
(1bit each)
Mode set register
:address, data
(4bit each)
:write command
bus
ADD(Write)
ADD
(4bit×4)
Input
control
circuit
(mode
change
control
&
channel
control)
SRAM(LUT)
DIN
(4bit×4)
D
256W×8bit
CK
CE
R/W
Output
control
circuit
(register,
switch,
etc
Control)
(4bit
REG x8)
(4bit×4)
(4bit×4)
17
Kanji Otsuka, Meisei University
Operation mode of basic cell (Memory-logic conjugate cell)
Rich operation modes can construct flexible and variable systems.
Operation
mode
Through Access mode
(= initial mode)
S/R=“L”
(reset mode)
System mode
S/R=“H”
Memory
mode
Logic mode
External memory mode
Route Configuration Register Mode
(making LUT)
External memory mode
Route Configuration
Register Mode (making LUT for dynamic
reconfiguration)
Arithmetic operation mode
Logic library mode (Macro-cell)
Combinational Circuit mode
For dynamic reconfiguration
Internal memory mode
Information Update mode for
Route Configuration Register
Route Configuration mode
by Mode Register
18
Kanji Otsuka, Meisei University
Outlook of MLCS structure
Some size of cluster allocation matches to operation and logic density.
Other Systems
(including Cluster memory)
Multiple bus
Clk + Control signal
Data( 8 bit×n )
Basic Cell
Array
Basic Cell
Array
・・・・・・
・・・・・・
CX
Control Circuit
+Bus I/F
decoders
Basic Cell
・・・・・・
Addresses
n columns
m rows
・・・・・・
Cluster memory
CY
(address space of Cluster memory)
q bit
8 bit
Extension
address
Memory – Logic Conjugate System (MLCS):
Memory address
Total system including some Cluster memories
of B.C.
19
Kanji Otsuka, Meisei University
Actual design of four basic cell configuration
Area for TSVs
Four basic cell
Memory (SRAM) for testing
256W x 8bit x 4cell
20
Kanji Otsuka, Meisei University
MLCS memory space
Memory space is adjustable for dynamic reconfiguration function.
Cluster memory 2
256w
256w
Cluster memory 3
256w
Cluster memory 1
256w
256w
256w
Cluster memory n
:logic mode
:memory mode
256w
Channel set register
256w
Bus switch
For memory space
256w
256w
For memory space
256w
Memory space of LSI
Basic cell
Memory space of MLCS
Kanji Otsuka, Meisei University
21
Cluster memory layout example in single 8 bit ALU
● Area is about 330X330um2 @90nm process (One Cluster)
PC Adder & 8bit ALUs (one resource shared)
(decoder control)
Logical judgment circuit
Basic cell
decoder
00
shifter(8bit
)
01
10
Y
(2)structure of 8bit ALU
.To enable 2-cycle 16bit addition,
new type of adder with carry code input
is introduced (which uses 4 Basic Cells).
11
00
01 10
(Note)
(1)Program counter:16bit
.2-cycle operation in case of overflow in
address operation
.1-cycle operation (without overflow)
(by using 8bit ALU)
11
X
Instruction
decoder
Basic cell array
Reserve
part
Program
memory
(512w×8b)
22
Kanji Otsuka, Meisei University
22
Performance comparison between pure logic and MLCS
Operation speed of processor mode
MLCS (8bit)
Band
Pure
frequency
logic**
NonFour
(8/32bit) parallel
parallel*
Maximum
4GHz
1GHz
MLCS (32bit)
Nonparallel
Four
parallel*
1GHz
4GHz
4GHz
Mean rate
?
(1GHz)
(4GHz)
(1GHz)
(3GHz )
Note: *Incase of 50% independency between four threads
**One thread in pure logic that is superior than the SRAM based MLCS
Program command + data
Four multi-thread processing
Rearrangement
Power consumption on the same logic with one thread
Power
Pure logic
MLCS
FPGA
Relative ratio
1
2
20
Area consumption on the same logic with different peripheral circuit α , β , γ
α , γ : constant size with some
Area
Pure logic
MLCS
FPGA
allowance design
Ratio
β : dynamic size with minimum
1+ α
7+β
30 + γ
design
Pure logic would be the best for processing, however MLCS can operate
dynamic reconfiguration mode and memory function.
Meisei University Confidential
23
Configuring from cluster to mat structure
controlled by synchronous clock
decoders
Basic Cell Array
=Cluster
Control
Circuit
Control
Circuit
decoders
Control
Circuit
Control
Circuit
decoders
decoders
decoders
Basic Cell Array
=Cluster
decoders
A mat
(unit
processor
element)
Basic Cell Array
=Cluster
decoders
decoders
Basic Cell Array
=Cluster
Cluster
memory
Position of
clock
supply
Space for wiring
and TSVs
connecting
between clusters
in a mat
24
Kanji Otsuka, Meisei University
Clock timing image for synchronous and asynchronous
Sub-Processor
Master clock ; asynchronous on mat-to-mat
cluster
Dynamic access by
asynchronous clock on
mat-to-mat with
dynamic reconfiguration
Hit signal from neighbor mat
by the header of a packet
Clock synchronous cube, we said Mat
Kanji Otsuka, Meisei University
25
Increasing and
decreasing
depend on
cache hit ratio
Array of mat
Dynamic reconfiguration algorism
Adding cache by
new generated
logic
Cache
surrounded
the logic
Logic
Adjacent addressing can
save the latency within
1clock within
synchronous cube
When job
capacity
increasing
Cache
surrounded
the logic
Expanding Logic
Of course, mat itself can
dynamically set number
of registers depend on
requirement.
Mat also can include
penetrated caches inside.
Multi task
with shared
cache
26
Kanji Otsuka, Meisei University
27
Kanji Otsuka, Meisei University
Other approach in technical papers.
Memory structured LUT presented by Masayuki Sato, RECONF Symposium 2006.9
One idea introduce as half quadrate interconnection memory based
logic circuit in random array, however still memories are consumed for
interconnection / switching. Rearrangement of unit tile is developing
now by Mr. Sato and Prof. Hironaka from Hiroshima City University.
28
Kanji Otsuka, Meisei University
Next significant issue is power saving.
Is there drastic power saving method?
Yes we have one idea.
start
1 2
I = mv , K = mv
2
stop
Radiation of heat
29
SEMICON Taiwan 2011
Kanji Otsuka, Meisei University
Physics of power consumption
1
start
I = mv , K =
Power consumption on unit circuit
RC遅延回路
Radiation of heat
stop
Voltage
Current
On current
CT
Ron
RI
CI
Off current
CL
imax
0
Vdd Current to waste
=
Ron
We should recover it.
Q[C] = (CT + C L + C I ) ⋅ Vdd
P[ W ] =
2
mv 2
1
2
(CT + C L + C I ) ⋅ Vdd
2
⎛
⎛
⎞
⎛
⎞⎞
t
t
⎟⎟
⎟⎟ ⎟ , v f = Vdd exp⎜⎜
vr = Vdd ⎜⎜1 − exp⎜⎜
⎟
(
)
(
)
R
C
R
C
⎝ on sum ⎠
⎝ on sum ⎠ ⎠
⎝
⎛
⎛
⎞
⎛
⎞⎞
t
t
⎟⎟ , idis = imax ⎜1 − exp⎜⎜
⎟⎟ ⎟
ich = imax exp⎜⎜
⎟
⎜
⎝ Ron (Csum ) ⎠
⎝ Ron (Csum ) ⎠ ⎠ 30
⎝
Kanji Otsuka, Meisei University
One of solution can be found on electric motor car operation.
1 2 Charge by brake
I = mv , K = mv
2
Discharge
Sports EV
battery
However, transistor can not recover the active carrier energy,
we all would think. Is that true?
G
S
P-type
S
D
N-type
D
N-type
N-type
Active carriers on
conduction band
0V
G
Vacancy layer
P-type
association
N-type
Diffusing and shifting
to valence band
Generating heat
31
Kanji Otsuka, Meisei University
Huge power!!
Power supply building
K computer, performance : 10PFLOPS, Largest computer in the world at now
32
Kanji Otsuka, Meisei University
Recovering signal energy method: Active carrier
reused on differential CMOS circuit
Output
characteristic
impedance
Z0=100Ω
Differential MOS’s in
the same well
Input
characteristic
impedance
Z0=100Ω
Source
Gate
Drain
Key structure is that
differential MOS
transistors are positioned
in the same well.
Space
1um
Differential pair
4.3um
2um
11.5um
7.2um
5um
33
Kanji Otsuka, Meisei University
Recovering signal energy method: Active carrier
reused on differential CMOS I/O Driver
VDD
VDD
VDD
VDD
INP
OUTN
INN
OUTP
Arrangement differential transistors in the same well
VRF
-
Input ESD
Inverter
IN-Negative
IN-Positive
+
P
n+
P
+
P
Output ESD
Current control
P
N
N
N-Well
p+
N
N
P-Well
P_SUB
Kanji Otsuka, Meisei University
34
Unit cell ray-out configuration
ESD
Inverter
ESD
Kanji Otsuka, Meisei University
36
37
SEMICON Taiwan 2011
Kanji Otsuka, Meisei University
Initial
After inversion
Transient
Forced releasing carrier
by capacitance change
Moving free carrier
to other
capacitance by
voltage sink
Paired switch in
same well
Discharge limiting inductance
at carrier rejection through
source or drain
Set condition is as mobility of hole=4×102[cm2/Vs] at 300k in
carrier density 1014~1015[cm-3], and Vdd=1.8V. Then drift speed
D=7.2×102 [cm2/s] is counted. When carrier traveling length is
10μm, 0.001cm=√Dt=√2×102・t is derived, thus t=1.3×109s=1.3ns is given comparing with longer time for our object rise
time of pulse 100ps (3GHz equivalent). But electron travel time
is 130ps that is our order of rise time.
Kanji Otsuka, Meisei University
38
Carrier reuse driver chip
39
Kanji Otsuka, Meisei University
R for current measurement
Cip=0.47pF
Cwel=1.56pF
Flip chip bonding
Cip=0.47pF
Terminator 100ohm
Differential probing
Z0=100ohm
Cin=0.45pF Cin=0.45pF
Substrate wiring length
for differential output;
8mm Z0=100Ω
Differential
input
“0.18um node” conventional
CMOS process
Power current measurement from
the voltage drop at 4.7ohm series
resistance.
Z0=100ohm
0.25mm length
IC chip
We can save the power by carrier reused circuit.
14
Differential inverter current depending on frequency
Current [mA]
Current[mA]
12
Reduction!!
10
8
DC current by current
Vdd
control transistors and
Calculation
Calculation
current
current
by by
cap.
cap.
clumping drivers on others
4
Depressed swing height region
Ohmic
current
Ohmic
current
2
Current
Current
at Vdd
at Vdd
1.8V
1.8V
6
0.001
0.01
0.1
Frequency [GHz]
0
1
Kanji Otsuka, Meisei University
10
40
Random pulse eye pattern
shows high speed even in
0.18um process node.
4mm
termination
Probe
point
FR-4 substrate:transmission line =100Ω ESDZ=50Ω
VCC=1.8V termination=100Ω、input swing1.8V
8Gbps
9Gbps
11Gbps
12Gbps
Kanji Otsuka, Meisei University
10Gbps
41
More effective carrier reuse circuit structure is in
double gate Fin type.
Drain 2
Gate 2
Source 2
Insulating
layer
Drain1
Gate1
Source1
drain
Gate
source
42
Kanji Otsuka, Meisei University
Power saving image in each device used
by carrier reuse transistor circuit
Relative power consumption level
Device
Function
Initial / Carrier reuse
Power saving ratio
(1) Pure logic ALU
15 to 30 %
Peripheral
I/O
(2) DRAM
memory mat
10 to 30 %
Addressing
I/O
(3) SRAM
Memory mat
25 to 45 %
Addressing
I/O
(4) MLCS
M/L mat
with small Addressing
cell
I/O
Less than SRAM due to
small cell
Applicable on all differential circuit
Kanji Otsuka, Meisei University
30 to 50 %
43
Summary for a solution
Previous listed task
Solution
1. Find function and performance
beyond TSV area penalty
Tile or small block array structure
through TSV interconnection
3. Made redundancy
Unified circuit such as memorylogic conjugation system
4. Choosing power saving circuit
and system
Carrier reuse transistor circuit
5. Effective function and
performance turning over cost
Unified circuit such as memorylogic conjugation system
6. Easy design algorithm
Unified circuit such as memorylogic conjugation system
As like my presentation example, more fundamental
physics and algorithm concept should be developed for 3D
structure with TSVs.
44
Kanji Otsuka, Meisei University