How transferable are features in deep neural

DL Hacks
How transferable are features in
deep neural networks?
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson
!  NIPS2014
15
! 
!  Deep Learning
! 
! 
! 
! 
! 
DNN general
specific
! 
1
! 
! 
general
specific
! 
specific
general
http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/CVPR2012-Tutorial_lee.pdf
general
specific
1.
2.
general
general
specific
3.
DNN general
specific
base task
target task
general
transfer learning
&
[Pan et al. 2010]
inductive
fine-tune frozen
2
1.
2.
fine-tune
frozen
frozen
fine-tune
A
B
!  A B 8
baseA baseB
!  8
! 
n
{1,2,…,7}
n
1
n
n=
!  B3B
baseB
B
3
frozen,
selffer
!  A3B baseA
B
3
!  frozen
fine-tune
frozen
=fine-tune
frozen,
transfer
B3B+
A3B+
n of the experimental treatments
AnB
BnA Top two rows: The base networks
Figure 1: Overview
and controls.
are trained using standard supervised backprop on only half of the ImageNet dataset (first row: A
half, second row: B half). The labeled rectangles (e.g. WA 1 ) represent the weight vector learned for
! 
! 
A3B baseB
B
general
A3B baseB
specific
3
3
A
Figure 1: Overview of the experimental treatments and controls. Top two rows: The base networks
are trained using standard supervised backprop on only half of the ImageNet dataset (first row: A
ImageNet[Deng et al., 2009]
!  1000
!  A B
1281167
500
50000
645000
! 
!  ImageNet
6
13
7
! 
! 
man-made 551
natural
449
Caffe [Jia et al., 2014]
! 
! 
Caffe
http://yosinski.com/transfer
! 
!  ipython notebook
1
• 
0.66
7op-1 accuracy (higher is better)
0.64
0.62
0.60
0.58
baseB
selffer BnB
selffer BnB +
transfer AnB
transfer AnB +
0.56
0.54
0.52
0
1
2
3
4
2
3
4
5
Layer n at whLFh network Ls Fhopped and retraLned
5
6
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
0 represent the accuracy of baseB. There are eight points, because we tested on four separate
random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent
BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light
red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual
•  A B
•  AnA BnB BnB
4
AnB
7
1
0.66
7op-1 accuracy (higher is better)
0.64
0.62
0.60
0.58
baseB
selffer BnB
selffer BnB +
transfer AnB
transfer AnB +
0.56
0.54
0.52
0
1
2
3
4
5
6
7
1
!  baseB
!  500
37.5%
1000
1000
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
0 represent the accuracy of baseB. There are eight points, because we tested on four separate
random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent
BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light
red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual
clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each
line refer to which interpretation from Section 4.1 applies.
42.5%
1
0.66
7op-1 accuracy (higher is better)
0.64
0.62
0.60
0.58
baseB
selffer BnB
selffer BnB +
transfer AnB
transfer AnB +
0.56
0.54
0.52
2
BnB
! 
NN
4,5
0
1
2
3
4
5
6
7
B
(co-adaptation)
frozen
! 
!  6,7
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
0 represent the accuracy of baseB. There are eight points, because we tested on four separate
random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent
BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light
red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual
clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each
line refer to which interpretation from Section 4.1 applies.
1
0.66
7op-1 accuracy (higher is better)
0.64
0.62
0.60
0.58
baseB
selffer BnB
selffer BnB +
transfer AnB
transfer AnB +
0.56
0.54
0.52
0
1
2
3
4
5
6
7
3
!  BnB+
baseB
fine-tuning
BnB
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
0 represent the accuracy of baseB. There are eight points, because we tested on four separate
random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent
BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light
red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual
clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each
line refer to which interpretation from Section 4.1 applies.
1
0.66
7op-1 accuracy (higher is better)
0.64
0.62
0.60
0.58
baseB
selffer BnB
selffer BnB +
transfer AnB
transfer AnB +
0.56
0.54
0.52
0
1
2
3
4
5
6
7
4
!  AnB
1
!  3
1,2
2
general
4
2
3, 4
5~7
~7
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
0 represent the accuracy of baseB. There are eight points, because we tested on four separate
random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent
BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light
red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual
clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each
line refer to which interpretation from Section 4.1 applies.
&specific
1
0.66
7op-1 accuracy (higher is better)
0.64
0.62
Table 1: Performance boost of AnB+ over controls, averaged over differen
0.60
0.58
baseB
selffer BnB
selffer BnB +
transfer AnB
transfer AnB +
0.56
0.54
5
!  AnB+
! 
! 
!  fine-tuning
0.52
layers
aggregated
1-7
3-7
5-7
0
1
2
baseB
3
BnB+
4
5
6
mean boost
over
baseB
1.6%
1.8%
2.1%
7
4.2 Dissimilar Datasets: Splitting M an-made and Natural ClassesI nto S
As mentioned previously, the effectiveness of feature transfer is expected to d
target tasks become less similar. We test this hypothesis by comparing tran
similar datasets (the random A/B splits discussed above) to that on dissimila
assigning man-made object classes to A and natural object classes to B. This m
creates datasets as dissimilar as possible within the ImageNet dataset.
The upper-left subplot of Figure 3 shows the accuracy of a baseA and baseB n
and BnA and AnB networks (orange hexagons). Lines join common target ta
two lines contains those networks trained toward the target task containing natu
and AnB). Thesenetworksperform better than thosetrained toward theman-m
may be due to having only 449 classes instead of 551, or simply being an easi
BnB+
4.3 Random Weights
! 
! 
mean boost
over
selffer BnB+
1.4%
1.4%
1.7%
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
Wewe
also
compare
to random, untrained weights because Jarrett et al. (2009) s
0 represent the accuracy of baseB. There are eight points, because
tested
on four separate
random A/B splits. Each dark blue dot represents a BnB network.
Light
ingly
—blue
thatpoints
the represent
combination of random convolutional filters, rectification, p
BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds
are AnB networks,
and light
malization
can work
almost as well as learned features. They reported this res
red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual
networks
of twoabove
or three
clarity. Bottom: Lines connecting the means of each treatment. Numbered
descriptions
each learned layers and on the smaller Caltech-101 dataset
line refer to which interpretation from Section 4.1 applies.
It isnatural to ask whether or not thenearly optimal performanceof random filt
over to a deeper network trained on a larger dataset.
1
0
1
2
3
4
5
6
7
5: 7ransfer + fLne-tunLng LPproves generalLzatLon
7op-1 aFFuraFy (hLgher Ls better)
0.64
3: )Lne-tunLng reFovers Fo-adapted LnteraFtLons
0.62
2: 3erforPanFe drops
due to fragLle
Fo-adaptatLon
4: 3erforPanFe
drops due to
representatLon
speFLfLFLty
0.60
0.58
0.56
0.54
0
1
2
3
4
5
Layer n at whLFh network Ls Fhopped and retraLned
6
7
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
0 represent the accuracy of baseB. There are eight points, because we tested on four separate
2
• 
0an-made/1atural splLt
7Rp-1 accuracy
0.7
0.6
0.5
0.4
0.3
0
• 
• 
• 
1
2
4
5
6
7
baseA man-made, baseB natural
baseA
baseB
BnA
AnB
•  natural
man-made
•  natural 449
•  natural
3
man-made 551
3
!  [Jarrett et al. 2009]
! 
! 
Rectification
NN 2,3
Caltech-101
NN&
3
! 
5andRm, untraLned fLlters
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
1
2
! 
!  1,2
! 
! 
3
NN
[Jarrett et al. 2009]
3
4
5
6
7
5elatLve tRp-1 accuracy (hLgher Ls better)
1,2,3
0.00
−0.05
−0.10
−0.15
−0.25
−0.30
• 
• 
reference
mean AnB, randRm splLts
mean AnB, m/n splLt
randRm features
−0.20
0
1
2
3
4
5
/ayer n at whLch netwRrN Ls chRpped and retraLned
6
7
Figure 3: Performance degradation vs. layer. Top left: Degradation when transferring between dissimilar tasks (from man-made classes of ImageNet to natural classes or vice versa). The upper line
connects networks trained to the “ natural” target task, and the lower line connects those trained toward the “ man-made” target task. Top right: Performance when the first n layers consist of random,
untrained weights. Bottom: The top two plots compared to the random A/B split from Section 4.1
(red diamonds), all normalized by subtracting their base level performance.
dataset, making their random filters perform better by comparison. In the supplementary material,
•  [Jarrett
et anal.
et al. 2009]
we provide
extra2009]
experiment indicating the extent to which our networks[Jarrett
are overfit.
Caltech-101
5 Conclusions
We have demonstrated a method for quantifying the transferability of features from each layer of
a neural network, which reveals their generality or specificity. We showed how transferability is
negatively affected by two distinct issues: optimization difficulties related to splitting networks in
themiddleof fragilely co-adapted layersand thespecialization of higher layer featuresto theoriginal
NN
general
!  frozen
2
! 
!  specific
! 
! 
! 
fine-tuning
specific
! 
&
! 
!  DL
!  NN
! 
! 
! 
DL
GPU 9.5
! 
!  R. Caruana, “Multitask learning,” Mach. Learn., vol. 28(1), pp. 41–75,
1997.
!  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
Deep Learning
! 
!  Y. Bengio, “Deep Learning of Representations for Unsupervised and
Transfer Learning,” JMLR Work. Conf. Proc. 7, vol. 7, pp. 1–20, 2011.
! 
! 
http://yosinski.com/transfer
!  CVPR 2012 Tutorial : Deep Learning Methods for Vision
http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/
CVPR2012-Tutorial_lee.pdf