DL Hacks How transferable are features in deep neural networks? Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson ! NIPS2014 15 ! ! Deep Learning ! ! ! ! ! DNN general specific ! 1 ! ! general specific ! specific general http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/CVPR2012-Tutorial_lee.pdf general specific 1. 2. general general specific 3. DNN general specific base task target task general transfer learning & [Pan et al. 2010] inductive fine-tune frozen 2 1. 2. fine-tune frozen frozen fine-tune A B ! A B 8 baseA baseB ! 8 ! n {1,2,…,7} n 1 n n= ! B3B baseB B 3 frozen, selffer ! A3B baseA B 3 ! frozen fine-tune frozen =fine-tune frozen, transfer B3B+ A3B+ n of the experimental treatments AnB BnA Top two rows: The base networks Figure 1: Overview and controls. are trained using standard supervised backprop on only half of the ImageNet dataset (first row: A half, second row: B half). The labeled rectangles (e.g. WA 1 ) represent the weight vector learned for ! ! A3B baseB B general A3B baseB specific 3 3 A Figure 1: Overview of the experimental treatments and controls. Top two rows: The base networks are trained using standard supervised backprop on only half of the ImageNet dataset (first row: A ImageNet[Deng et al., 2009] ! 1000 ! A B 1281167 500 50000 645000 ! ! ImageNet 6 13 7 ! ! man-made 551 natural 449 Caffe [Jia et al., 2014] ! ! Caffe http://yosinski.com/transfer ! ! ipython notebook 1 • 0.66 7op-1 accuracy (higher is better) 0.64 0.62 0.60 0.58 baseB selffer BnB selffer BnB + transfer AnB transfer AnB + 0.56 0.54 0.52 0 1 2 3 4 2 3 4 5 Layer n at whLFh network Ls Fhopped and retraLned 5 6 Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents the average accuracy over the validation set for a trained network. The white circles above n = 0 represent the accuracy of baseB. There are eight points, because we tested on four separate random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual • A B • AnA BnB BnB 4 AnB 7 1 0.66 7op-1 accuracy (higher is better) 0.64 0.62 0.60 0.58 baseB selffer BnB selffer BnB + transfer AnB transfer AnB + 0.56 0.54 0.52 0 1 2 3 4 5 6 7 1 ! baseB ! 500 37.5% 1000 1000 Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents the average accuracy over the validation set for a trained network. The white circles above n = 0 represent the accuracy of baseB. There are eight points, because we tested on four separate random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each line refer to which interpretation from Section 4.1 applies. 42.5% 1 0.66 7op-1 accuracy (higher is better) 0.64 0.62 0.60 0.58 baseB selffer BnB selffer BnB + transfer AnB transfer AnB + 0.56 0.54 0.52 2 BnB ! NN 4,5 0 1 2 3 4 5 6 7 B (co-adaptation) frozen ! ! 6,7 Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents the average accuracy over the validation set for a trained network. The white circles above n = 0 represent the accuracy of baseB. There are eight points, because we tested on four separate random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each line refer to which interpretation from Section 4.1 applies. 1 0.66 7op-1 accuracy (higher is better) 0.64 0.62 0.60 0.58 baseB selffer BnB selffer BnB + transfer AnB transfer AnB + 0.56 0.54 0.52 0 1 2 3 4 5 6 7 3 ! BnB+ baseB fine-tuning BnB Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents the average accuracy over the validation set for a trained network. The white circles above n = 0 represent the accuracy of baseB. There are eight points, because we tested on four separate random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each line refer to which interpretation from Section 4.1 applies. 1 0.66 7op-1 accuracy (higher is better) 0.64 0.62 0.60 0.58 baseB selffer BnB selffer BnB + transfer AnB transfer AnB + 0.56 0.54 0.52 0 1 2 3 4 5 6 7 4 ! AnB 1 ! 3 1,2 2 general 4 2 3, 4 5~7 ~7 Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents the average accuracy over the validation set for a trained network. The white circles above n = 0 represent the accuracy of baseB. There are eight points, because we tested on four separate random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each line refer to which interpretation from Section 4.1 applies. &specific 1 0.66 7op-1 accuracy (higher is better) 0.64 0.62 Table 1: Performance boost of AnB+ over controls, averaged over differen 0.60 0.58 baseB selffer BnB selffer BnB + transfer AnB transfer AnB + 0.56 0.54 5 ! AnB+ ! ! ! fine-tuning 0.52 layers aggregated 1-7 3-7 5-7 0 1 2 baseB 3 BnB+ 4 5 6 mean boost over baseB 1.6% 1.8% 2.1% 7 4.2 Dissimilar Datasets: Splitting M an-made and Natural ClassesI nto S As mentioned previously, the effectiveness of feature transfer is expected to d target tasks become less similar. We test this hypothesis by comparing tran similar datasets (the random A/B splits discussed above) to that on dissimila assigning man-made object classes to A and natural object classes to B. This m creates datasets as dissimilar as possible within the ImageNet dataset. The upper-left subplot of Figure 3 shows the accuracy of a baseA and baseB n and BnA and AnB networks (orange hexagons). Lines join common target ta two lines contains those networks trained toward the target task containing natu and AnB). Thesenetworksperform better than thosetrained toward theman-m may be due to having only 449 classes instead of 551, or simply being an easi BnB+ 4.3 Random Weights ! ! mean boost over selffer BnB+ 1.4% 1.4% 1.7% Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents the average accuracy over the validation set for a trained network. The white circles above n = Wewe also compare to random, untrained weights because Jarrett et al. (2009) s 0 represent the accuracy of baseB. There are eight points, because tested on four separate random A/B splits. Each dark blue dot represents a BnB network. Light ingly —blue thatpoints the represent combination of random convolutional filters, rectification, p BnB+ networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light malization can work almost as well as learned features. They reported this res red diamonds are the fine-tuned AnB+ versions. Points are shifted slightly left or right for visual networks of twoabove or three clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions each learned layers and on the smaller Caltech-101 dataset line refer to which interpretation from Section 4.1 applies. It isnatural to ask whether or not thenearly optimal performanceof random filt over to a deeper network trained on a larger dataset. 1 0 1 2 3 4 5 6 7 5: 7ransfer + fLne-tunLng LPproves generalLzatLon 7op-1 aFFuraFy (hLgher Ls better) 0.64 3: )Lne-tunLng reFovers Fo-adapted LnteraFtLons 0.62 2: 3erforPanFe drops due to fragLle Fo-adaptatLon 4: 3erforPanFe drops due to representatLon speFLfLFLty 0.60 0.58 0.56 0.54 0 1 2 3 4 5 Layer n at whLFh network Ls Fhopped and retraLned 6 7 Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents the average accuracy over the validation set for a trained network. The white circles above n = 0 represent the accuracy of baseB. There are eight points, because we tested on four separate 2 • 0an-made/1atural splLt 7Rp-1 accuracy 0.7 0.6 0.5 0.4 0.3 0 • • • 1 2 4 5 6 7 baseA man-made, baseB natural baseA baseB BnA AnB • natural man-made • natural 449 • natural 3 man-made 551 3 ! [Jarrett et al. 2009] ! ! Rectification NN 2,3 Caltech-101 NN& 3 ! 5andRm, untraLned fLlters 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 1 2 ! ! 1,2 ! ! 3 NN [Jarrett et al. 2009] 3 4 5 6 7 5elatLve tRp-1 accuracy (hLgher Ls better) 1,2,3 0.00 −0.05 −0.10 −0.15 −0.25 −0.30 • • reference mean AnB, randRm splLts mean AnB, m/n splLt randRm features −0.20 0 1 2 3 4 5 /ayer n at whLch netwRrN Ls chRpped and retraLned 6 7 Figure 3: Performance degradation vs. layer. Top left: Degradation when transferring between dissimilar tasks (from man-made classes of ImageNet to natural classes or vice versa). The upper line connects networks trained to the “ natural” target task, and the lower line connects those trained toward the “ man-made” target task. Top right: Performance when the first n layers consist of random, untrained weights. Bottom: The top two plots compared to the random A/B split from Section 4.1 (red diamonds), all normalized by subtracting their base level performance. dataset, making their random filters perform better by comparison. In the supplementary material, • [Jarrett et anal. et al. 2009] we provide extra2009] experiment indicating the extent to which our networks[Jarrett are overfit. Caltech-101 5 Conclusions We have demonstrated a method for quantifying the transferability of features from each layer of a neural network, which reveals their generality or specificity. We showed how transferability is negatively affected by two distinct issues: optimization difficulties related to splitting networks in themiddleof fragilely co-adapted layersand thespecialization of higher layer featuresto theoriginal NN general ! frozen 2 ! ! specific ! ! ! fine-tuning specific ! & ! ! DL ! NN ! ! ! DL GPU 9.5 ! ! R. Caruana, “Multitask learning,” Mach. Learn., vol. 28(1), pp. 41–75, 1997. ! S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010. Deep Learning ! ! Y. Bengio, “Deep Learning of Representations for Unsupervised and Transfer Learning,” JMLR Work. Conf. Proc. 7, vol. 7, pp. 1–20, 2011. ! ! http://yosinski.com/transfer ! CVPR 2012 Tutorial : Deep Learning Methods for Vision http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/ CVPR2012-Tutorial_lee.pdf
© Copyright 2024