Download Report

Rotating Your Face Using Multi-Task Deep Neural Network
(CVPR 2015)
Junmo Kim
Joint work with Junho Yim, Heechul Jung, ByungIn Yoo, Changkyu Choi, Dusik Park
School of Electrical Engineering, KAIST
Samsung Advanced Institute of Technology, SAIT
2015. 04. 24
Introduction
Pose- and Illumination- invariant Face Recognition
• Pose- and Illumination- invariant Face Recognition
Pose
Illumination
2
DeepFace (CVPR 2014)
• Face recognition pipeline:
• Detect  align  represent  classify
• Alignment: employed explicit 3D face modeling
• Representation: a 9-layer deep neural network
• More than 120 million parameters
• Locally connected layers without weight sharing
• Trained with 4 million labeled face images
• 97% recognition rate on LFW
Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR’14
DeepID2 (NIPS2014)
• 99.15 % face verification
accuracy on LFW
• Trained with CelebFaces+
dataset
• 202,599 face images of 10,177
identities (celebrities)
Sun et al. Deep Learning Face Representation by Joint Identification-Verification
Identity-Preserving Face Transform (ICCV2013)
• Deep network extracts face identity-preserving (FIP)
features.
• Frontal face image is synthesized from FIP.
Zhu et al. Deep Learning Identity-Preserving Face Space, ICCV 2013
Proposed Model
(Naïve Version)
6
Proposed Model (Naïve Version)
• Pose- and illumination- invariant feature
• Input image : arbitrary pose under arbitrary illumination Image
• Output : generated desired pose images under frontal illumination
Input Image
Output Image
−𝟔𝟔𝟔𝟔°
−𝟒𝟒𝟒𝟒°
𝟎𝟎°
𝟏𝟏𝟏𝟏°
𝟑𝟑𝟑𝟑°
−𝟔𝟔𝟔𝟔°
−𝟒𝟒𝟒𝟒°
𝟎𝟎°
𝟏𝟏𝟏𝟏°
𝟑𝟑𝟑𝟑°
7
Proposed Model (Naïve Version)
• Remote Code
• represents the desired pose information
• ex : 0 0 1 0 0 represents that “Rotate to 0.”
• Training
Layer v hidden1 hidden2 hidden3 output
Input Image
Output Image
• Using Multi-PIE dataset
• Input : 1. Arbitrary pose image
label
Remote Code
0 0 1 0 0
Simple repetition code
DNN with All Fully Connected Layers
8
Proposed Model (Naïve Version)
• Remote Code
• represents the desired pose information
• ex : 0 0 1 0 0 represents that “Rotate to 0.”
• Training
Layer v hidden1 hidden2 hidden3 output
Input Image
Output Image
• Using Multi-PIE dataset
• Input : 1. Arbitrary pose image
Under arbitrary illumination
label
2. Remote Code
• Output : Desired pose image
Remote Code
0 0 1 0 0
Under frontal illumination
Simple repetition code
DNN with All Fully Connected Layers
9
Proposed Model (Naïve Version)
• Remote Code
• Result
01000
00100
00100
01000
Input
Image
Remote
Code
Output
Image
Input
Image
Remote
Code
Output
Image
10
Generating Multi-View Representation (NIPS 2014)
• Z.Zhu et al. “Deep Learning Multi-View Representation for Face Recognition”, NIPS 2014
• Can generate multi-view images from single view face image
• Several candidate face images are generated
and the best fit is selected
The inputs (first column) and the multi-view outputs
(remaining columns) of two identities.
=
=
=
=
input image
output image
view label of the output image
random binary hidden neurons11
Proposed Model (Naïve Version)
• Remote Code
• Result
Input
10000
01000
00100
00010
00001
• Problem
• Identity preserving  introduce one more task
• May overfit because of using all fully connected layers  use locally connected layer
12
Proposed Model
(Refined Version)
13
Multi-task Learning
• General Multi-task Learning
• Shares some layers to determine common features
• Remaining layers are split into multi-tasks
S. Li, Z.-Q. Liu, and A. B. Chan. Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In Computer Vision and Pattern
Recognition Workshops (CVPRW), 2014 IEEE Conference on, pages 488–495. IEEE, 2014.
14
Multi-task Learning
• Proposed model
Layer v hidden1 hidden2 output
Input Image
hidden3 hidden4
Output Image
Input Image
Remote Code
0 0 1 0 0
15
Multi-task Learning
• Proposed Idea
• Preserve identity by using the second task
16
Proposed Model (Refined Version)
• Proposed model
• Part 1 : Feature extraction part
• Part 2 : Feature rotation part
• Part 3 : Imaging part
• Part 4 : Reconstruction part
17
Proposed Model (Refined Version)
• Interesting Point
• Locally connected layer
• Using less weights than the fully connected layer
• More suitable than Convolutional layer
• Early Fully connected layer
• Change features to contain the target pose information
18
Proposed Model (Refined Version)
• Remote Code
• Remote code,
• Experimentally, how to put a code in the image is not important
19
Proposed Model (Refined Version)
• RECON CODE
• Recon code contains pose part,
and illumination part,
20
Proposed Model (Refined Version)
• Multi-task Learning
• Cost Function
• Squared 𝐿𝐿2 norm
• Main task
• Generate the desired pose Image
• Auxiliary task
• Reconstruct the input Image
• Total Cost Function
21
Experimental Result
22
Experimental Result
Experimental setting
• Multi-PIE dataset
• 337 subjects with 15 different poses under 20 illumination changes
23
Experimental Result
Experimental setting
• Experimental setting
• Setting 1
•
•
•
•
7 different poses (−45° ~ 45° ) under 20 different illuminations
Training : 100 subjects ( 14000 images )
Testing : 149 subjects ( 16986 images except 0° 𝑎𝑎𝑎𝑎𝑎𝑎 frontal illumination)
Gallery image : Each frontal image under frontal illumination
• Setting 2
•
•
•
•
9 different poses (−60° ~ 60° ) under 20 different illuminations
Training : 200 subjects ( 36000 images )
Testing : 137 subjects ( 24660 images )
Gallery image : Each frontal image under frontal illumination
24
Experimental Result
Experimental setting
• Face Recognition
Gallery Image
• Find NN using 𝐿𝐿2 norm
Synthesized Image
Synthesized Image
Synthesized Image
n001
n001
n002
test
n001
n003
n001
n001
1번 사람
정면
7번 조도
n149
2번 사람
정면
7번 조도
n001
Result :
n001
25
Experimental Result
Feature space
• Feature space
Target Pose
𝟎𝟎°
𝟏𝟏𝟏𝟏°
𝟑𝟑𝟑𝟑°
𝟒𝟒𝟒𝟒°
𝟔𝟔𝟔𝟔°
.
.
.
.
.
.
.
.
.
Test Image
−𝟔𝟔𝟔𝟔° −𝟒𝟒𝟒𝟒° −𝟑𝟑𝟑𝟑° −𝟏𝟏𝟏𝟏°
..
..
..
..
..
..
..
..
..
26
Experimental Result
Feature space
• Feature space
CPF
CPI
27
Experimental Result
Comparison with the state-of-the-art
• Face Recognition : Setting 1
• 7 poses, 20 illuminations
• 100 subjects training , 149 subjects test (16986 images)
• Recognition rates (%) for the various illuminations
0
1
2
3
4
5
6
8
9
10
Z.Zhu[1]
72.8
75.8
75.8
75.7
75.7
75.7
75.7
75.7
75.7
75.7
CPI
66.0
626
69.6
73.0
79.1
84.5
86.6
86.5
84.2
80.2
CPF
59.7
70.6
76.3
79.1
85.1
89.4
91.3
92.3
90.6
86.5
11
12
13
14
15
16
17
18
19
Avg.
Z.Zhu[1]
75.7
75.7
75.7
73.4
73.4
73.4
73.4
72.9
72.9
74.7
CPI
76.0
70.8
65.7
76.1
78.2
80.7
79.4
77.3
65.4
75.9
CPF
81.2
77.5
72.8
82.3
84.2
86.5
85.9
82.9
59.2
80.7
[1] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity preserving face space. In ICCV, 2013.
28
Experimental Result
Comparison with the state-of-the-art
• Face Recognition : Setting 1
• 7 poses, 20 illuminations
• 100 subjects training , 149 subjects test (16986 images)
• Recognition rates (%) for the various poses
-45
-30
-15
Z.Zhu[1]
67.1
74.6
CPI
66.6
CPF
73.0
0
15
30
45
Avg.
86.1
83.3
75.3
61.8
74.7
78.0
87.3
85.5
75.8
62.3
75.9
81.7
89.4
89.5
80.4
70.3
80.7
[1] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity preserving face space. In ICCV, 2013.
29
Experimental Result
Comparison with the state-of-the-art
• Face Recognition : Setting 2
• 9 poses, 20 illuminations
• 200 subjects training , 137 subjects test (24660 images)
• Recognition rates (%) for the various poses
-60
-45
-30
-15
0
15
30
45
60
Avg.
Z.Zhu[1]
44.6
63.6
77.5
90.5
94.3
89.8
80.0
56.5
38.9
70.8
Z.Zhu[2]
60.2
75.2
83.4
93.3
95.7
92.2
83.9
70.6
60.0
79.3
CPI
55.8
71.8
80.0
90.1
98.4
90.2
82.7
71.0
52.9
77.0
CPF
63.2
80.4
88.1
94.5
99.5
95.4
88.9
79.4
60.6
83.3
[2] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning multi-view representation for face recognition. In NIPS 2014
30
Experimental Result
Comparison with the Single task
• Superiority of the Multi-task learning
Multi-task Learning
Single task Learning
31
Experimental Result
Comparison with the Single task
• Superiority of the Multi-task learning
• Face Recognition : Setting 2
0
1
2
3
4
5
6
8
9
10
Single
45.4
64.3
72.9
74.9
82.0
86.9
89.8
89.7
87.9
81.7
Multi
59.7
70.6
76.3
79.1
85.1
89.4
91.3
92.3
90.6
86.5
11
12
13
14
15
16
17
18
19
Avg.
Single
76.5
72.2
66.7
76.9
80.9
82.7
79.9
76.5
47.1
75.5
Multi
81.2
77.5
72.8
82.3
84.2
86.5
85.9
82.9
59.2
80.7
32
Experimental Result
Comparison with the Late fully connected
• Superiority of the Early fully connected layer
• Face Recognition : Setting 1
Early fully connected layer
83.3%
Late fully connected layer
77.0%
33
Experimental Result
Comparison with the general multi-task learning
• Superiority of the proposed Multi-task learning
Proposed
General 2
General 1
General 3
34
Experimental Result
Comparison with the general multi-task learning
• Superiority of the Multi-task learning
• Face Recognition : Setting 1
-60
-45
-30
-15
0
15
30
45
60
Avg.
Proposed
63.2
80.4
88.1
94.5
99.5
95.4
88.9
79.4
60.6
83.3
General1
26.9
61.7
73.2
87.4
99.1
87.5
72.4
59.2
28.1
66.3
General2
34.7
64.5
74.1
80.8
96.9
85.3
75.2
67.5
32.9
68.0
General3
56.4
75.9
85.0
92.5
98.4
92.7
83.7
76.0
56.6
79.7
35
Experimental Result
• Generated Images
36
Conclusion
• Propose the novel type of network that can synthesize the
desired pose image by utilizing user’s Remote Code
represents
• Propose the novel type of Multi-task network that produces
better performance at preserving identity
• Clearly win against the previous state-of-the-art model by
more than 4~6%
37
38
39
40
Rotating My Face with Multi-Task DNN
-45
Original Image
Aligned face
-30
-15
0
15
30
45
Generated face
41
Appendix 1
• Z.Zhu et al. “Deep Learning Multi-View Representation for Face Recognition”, NIPS 2014
•
= input image
•
= output image
•
= view label of the output image
•
= random binary hidden neurons
• Sampled from a distribution
• Training
• learned by maximizing the data log-likelihood
42
43
Appendix 2
• The percentage of the number of CPFs contributing to final results
44
Appendix 3
• The feature space of 6000 features.
• Each pale dot color represents a different Remote Code.
• Same deep colors represent the features from a single identity.
45
Appendix 4
• Pose estimation
• LR : Linear Regression
• SVR : Support Vector Regression
error
Z.Zhu[2]
LR
SVR
Setting1
Setting2
5.92
9.79
5.45
5.56
4.29
[2] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning multi-view representation for face recognition. In NIPS 2014
46