Poster - 3D ShapeNets

This work is funded by
3D ShapeNets: A Deep Representation for Volumetric Shapes
Zhirong Wu
Motivation: 3D Shape Representation
4000
Desirable Properties of a Good 3D Shape Representation:
L2
13
48 filters of
stride 2
6
3D Deep Learning
1. Convert 3D ShapeNets to a 3D CNN
2. Feature learning by back-propagation
3. Test for classification and retrieval
5
fully connected RBM
multinomial label + Bernoulli
feature forms an associate memory
Learning
Contrastive Divergence
10 classes
FODVVLÀFDWLRQ
UHWULeval AUC
UHWULeval MAP
40 classes
FODVVLÀFDWLRQ
UHWULeval AUC
UHWULeval MAP
Contrastive Divergence
Fast Persistent
Contrastive Divergence
Princeton ModelNet Dataset
A large-scale CAD dataset of more than 150,000 models in 660 categories.
SPH [18]
79.79 %
45.97%
44.05%
SPH [18]
68.23%
34.47%
33.26%
10 Classes Results
0.9
0.9
0.8
0.8
0.7
0.7
bathtub
bed
chair
desk
Precision
[29] Depth
NN
ICP
3D ShapeNets
3D ShapeNets ÀQHWXQHG
[29] RGB
[29] RGBD
Depth map from
back of a sofa
Spherical Harmonic
Light Field
Our 5th layer finetuned
bathtub
0.000
0.429
0.571
0.142
0.857
0.142
0.000
bed
0.729
0.446
0.608
0.500
0.703
0.743
0.743
0.4
0.4
observed surface
unknown
newly visible surface
free space
potentially visible voxels in next view
original surface
three different next-view candidates
Spherical Harmonic
Light Field
Our 5th layer finetuned
0
0.1
0.2
0.3
0.4
monitor
monitor
stand
sofa
sofa
toilet
table
table
toilet
0.5
0.6
Recall
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
Precision Recall curve
0.4
0.5
0.6
Recall
0.7
0.8
0.9
Volumetric
representation
dresser
0.466
0.467
0.733
0.366
0.500
0.266
0.466
monitor
0.222
0.333
0.389
0.500
0.500
0.166
0.388
nightstand
0.343
0.188
0.438
0.719
0.625
0.218
0.468
sofa
0.481
0.458
0.349
0.277
0.735
0.313
0.602
table
0.415
0.455
0.052
0.377
0.247
0.376
0.441
toilet
0.200
0.400
1.000
0.700
0.400
0.200
0.500
all
0.376
0.374
0.471
0.437
0.579
0.334
0.448
The original entropy of the first view,
Given a fixed number of views, the conditional entropy for
view,
sofa?
experiements on synthetic depth for two views
Ours
Max Visibility
Furthest Away
Random Selection
bathtub?
1
desk
0.100
0.176
0.375
0.100
0.300
0.150
0.175
We choose the view with the maximum mutual information
!"#$%&'()(*+#
0.2
0.1
chair
0.806
0.395
0.194
0.685
0.919
0.766
0.693
The reduction of entropy is the mutual information between the new
voxel and the label,
0.3
0.3
0.1
dresser
Deep View Planning
dresser?
0.5
0.2
desk
Table 2: Accuracy for VLHZEDVHd 2.5D Recognition on NYU dataset.
0.6
0.5
0
bathtub
3D ShapeNets for 2.5 Depth Recognition
40 Classes Results
1
0.6
Examples of Chairs
Ours
83.54%
69.28%
68.26%
Ours
77.32%
49.94%
49.23%
Table 1: Shape &ODVVLÀFDWLRQ DQd 5HWULeval 5HVXOWV
1
Object Categories
LFD [8]
79.87 %
51.70%
49.82%
LFD [8]
75.47%
42.04%
40.91%
results on synthetic depth
3D Shape Generation
Precision
4
L1
3D Feature Extraction
Configuration of 3D ShapeNets
Type
convolutional RBM
30
3D voxel input
• 3D ShapeNets (a Convolutional Deep Belief Network) learns the joint
distribution of generic 3D shapes across object category.
• Meaningful volumetric shapes could be sampled from the model.
Recognition and Reconstruction on NYU depth
Training only uses CAD models!
Infer unknown space and label jointly by gibbs sampling
Shape Completion
5
4
• Data-driven: learn from data rather than predefined shape routines.
• Generic: any complex shapes rather than simple shape primitives.
• Compositional: compose complex shapes by assembling simple ones.
• Versatile: applicable to various vision tasks.
Layer
1-3
Identify free space, observed surface, unknow space
full 3D full 3D
Philbin et al. 2007
L4
free space
unknown
observed surface
observed points
completed surface
RGB
512 filters of
stride 1
5
Girshick et al. 2014
Convert the depth map into volumetric representation
L3
160 filters of
stride 2
Binford et al. 1989
L5
1200
2
It is a chair!
depth
Felzenszwalb et al. 2010
Recognition and Completion
chair
Biederman 1987
Jianxiong Xiao
Inference Process
object label 10
Rothganger et al. 2006
Xiaoou Tang
Code and Data: http://3DShapeNets.cs.princeton.edu
Limited to
Instance Level Matching
Dalal & Triggs 2005
Linguang Zhang
nightstand table desk bed
No State-of-the-arts
Use 3D Shapes
Aditya Khosla Fisher Yu
monitor dresser toilet bathtub sofa
Only in Theory
Hand Craft Design
Shuran Song
Shape completion
Next-Best-View
Recognition
3 possible shapes
predicted new freespace & visible surface
bathtub
0.80
0.85
0.65
0.60
bed
1.00
0.85
0.85
0.80
chair
0.85
0.85
0.75
0.75
desk
0.50
0.50
0.55
0.50
dresser
0.45
0.45
0.25
0.45
monitor
0.85
0.85
0.85
0.90
nightstand
0.75
0.75
0.65
0.70
sofa
0.85
0.85
0.50
0.65
table
0.95
0.90
1.00
0.90
toilet
1.00
0.95
0.85
0.90
all
0.80
0.78
0.69
0.72
Table 3: Comparison of Different Next-Best-View Selections Based on Recognition Accuracy from Two Views. Based
on an algorithm’s choice, we obtain the actual depth map for the next view and recognize the object using those two views in
our 3D ShapeNets representation.