Research Statement - School of Engineering and Applied Sciences

Research Statement in Computer Vision
Ying Xiong
School of Engineering and Applied Sciences
Harvard University
[email protected]
April 17th, 2015
Low-level Vision by Consensus in a Spatial Hierarchy of Regions [1, 2]
(a)
(b)
θp,Ip
τp, Dp(·)
+
ϕp,ep
Planar
Non-planar
(c)
(d)
k=3
Regions Pk
k=3
k=4
k=2
k=2
k=1
k=1
Scene
Map
Z(n)
n
n
Figure 1: Consensus framework for low-level vision, using binocular stereo as an example. (a) Window-based stereo
matching with a slanted-plane model reduces ambiguity, but it requires guessing the correct window shapes and sizes
throughoutt he scene. Consensus addresses this by explicitly considering all regions at all locations (depicted as a 2D
cartoon (b) and in 1D organized by scale (c)). It reasons simultaneously about which regions are inliers to the slantedplane model and the correct slanted plane for each inlying region. The regional slanted planes must agree where they
overlap, and in the objective this implies high-order factors that link the variables of thousands of regions that overlap
each pixel (blue in (b) and (c)). When regions are organized hierarchically (red/pink in (b)), optimization becomes
parallel and efficient. (d) The result is a distributed architecture, with computational units that iteratively perform the
same set of computations and share information sparsely between parents and children. The framework can be applied
to a variety of low-level tasks using a variety of regional models.
We introduce a multi-scale framework for low-level vision, where the goal is estimating physical scene values from
image data—such as depth from stereo image pairs. The framework uses a dense, overlapping set of image regions
at multiple scales and a “local model,” such as a slanted-plane model for stereo disparity, that is expected to be valid
piecewise across the visual field. Estimation is cast as optimization over a dichotomous mixture of variables, simultaneously determining which regions are inliers with respect to the local model (binary variables) and the correct
co-ordinates in the local model space for each inlying region (continuous variables). When the regions are organized
1
into a multi-scale hierarchy, optimization can occur in an efficient and parallel architecture, where distributed computational units iteratively perform calculations and share information through sparse connections between parents and
children. The framework performs well on a standard benchmark for binocular stereo, and it produces a distributional
scene representation that is appropriate for combining with higher-level reasoning and other low-level cues.
Log-Likelihood
From Shading to Local Shape [9, 10]
Figure 2: We infer from a Lambertian image patch a concise representation for the distribution of quadratic surfaces
that are likely to have produced it. These distributions naturally encode different amounts of shape information based
on what is locally available in the patch, and can be unimodal (row 2 & 4), multi-modal (row 3), or near-uniform (row
1). This inference is done across multiple scales.
We develop a framework for extracting a concise representation of the shape information available from diffuse shading
in a small image patch. This produces a mid-level scene descriptor, comprised of local shape distributions that are
inferred separately at every image patch across multiple scales. The framework is based on a quadratic representation
of local shape that, in the absence of noise, has guarantees on recovering accurate local shape and lighting. And when
noise is present, the inferred local shape distributions provide useful shape information without over-committing to
any particular image explanation. These local shape distributions naturally encode the fact that some smooth diffuse
regions are more informative than others, and they enable efficient and robust reconstruction of object-scale shape.
Experimental results show that this approach to surface reconstruction compares well against the state-of-art on both
synthetic images and captured photographs.
2
Consumer Cameras as Radiometric Devices [14, 3, 4]
0.7
0.6
RAW Blue
0.5
0.4
0.3
(220,244,248)
(255,255,255)
(237,253,253)
0.2 (161,226,246)
0.1
0.0
0.8
Raw 0.6
Green 0.4
(199,249,212) (250,252,203)
(255,255,145)
(255,204,252)
(189,210,83)
(255,187,159)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
RAW Red
Figure 3: Clusters of RAW measurements that each map to a single JPEG color value (indicated in parentheses)
in a digital SLR camera (Canon EOS 40D). Close-ups of the clusters emphasize the variations in cluster size and
orientation. When inverting the tone-mapping process, this is structured uncertainty that cannot be avoided.
To produce images that are suitable for display, tone-mapping is widely used in digital cameras to map linear color
measurements into narrow gamuts with limited dynamic range. This introduces non-linear distortion that must be
undone, through a radiometric calibration process, before computer vision systems can analyze such photographs
radiometrically. This paper considers the inherent uncertainty of undoing the effects of tone-mapping. We observe that
this uncertainty varies substantially across color space, making some pixels more reliable than others. We introduce
a model for this uncertainty and a method for fitting it to a given camera or imaging pipeline. Once fit, the model
provides for each pixel in a tone-mapped digital photograph a probability distribution over linear scene colors that
could have induced it. We demonstrate how these distributions can be useful for visual inference by incorporating
them into estimation algorithms for a representative set of vision tasks.
3
Blind Estimation and Low-rate Sampling of Sparse MIMO Systems [11]
Multipath environment
Figure 4: A MIMO system where I sources and L sensors are linked through a collection of LI channels {hi,` (t)}.
We present a blind estimation algorithm for multi-input and multi-output (MIMO) systems with sparse common support. Key to the proposed algorithm is a matrix generalization of the classical annihilating filter technique, which
allows us to estimate the nonlinear parameters of the channels through an efficient and noniterative procedure. An
attractive property of the proposed algorithm is that it only needs the sensor measurements at a narrow frequency
band. By exploiting this feature, we can derive efficient sub-Nyquist sampling schemes which significantly reduce the
number of samples that need to be retained at each sensor. Numerical simulations verify the accuracy of the proposed
estimation algorithm and its robustness in the presence of noise.
Fabricating BRDFs at High Spatial Resolution Using Wave Optics
2
Fabricating
BRDFs
Wave
Anat Levin1 Daniel
Glasner1 Using
Ying Xiong
Fr´eOptics
do Durand3[5]William Freeman3
1
Wafer
Weizmann Institute
2
Designed reflectance
Harvard
3
MIT CSAIL
Horizontal
illumination
Wojciech Matusik3
Todd Zickler2
Vertical
illumination
Figure 1: First column: a wafer fabricated using photolithography displaying spatially varying BRDFs at 220dpi. Second column: designed
Figure
First
column:
a to
wafer
using photolithography
spatially
varying BRDFs
at 220dpi.
pattern, 5:
color
coded
according
the 5fabricated
reflectance functions
used. Dithering is displaying
exaggerated for
better visualization.
Rightmost
columns:
fabricatedcolumn:
pattern as
imaged under
two different
illumination
directions.
designed pattern
includes
anisotropic
reflectance
functions at
Second
designed
pattern,
color coded
according
to theThe
5 reflectance
functions
used.
Dithering
is exaggerated
two opposite orientations. Hence, the image is inverted when light moves from the horizontal to the vertical directions.
for better visualization. Rightmost columns: fabricated pattern as imaged under two different illumination directions.
The
designed pattern includes anisotropic reflectance functions
at two
opposite
orientations.
Hence,
the image
cations,
including
printing,
product design,
luminaire
design, is
seAbstract
inverted when light moves from the horizontal to the vertical directions.
curity markers visible under certain illumination conditions, and
Recent attempts to fabricate surfaces with custom reflectance functions boast impressive angular resolution, yet their spatial resolution is limited. In this paper we present a method to construct
spatially varying reflectance at a high resolution of up to 220dpi
, orders of magnitude greater than previous attempts, albeit with a 4
lower angular resolution. The resolution of previous approaches is
limited by the machining, but more fundamentally, by the geometric optics model on which they are built. Beyond a certain scale geometric optics models break down and wave effects must be taken
into account. We present an analysis of incoherent reflectance based
on wave optics and gain important insights into reflectance design.
We further suggest and demonstrate a practical method, which takes
into account the limitations of existing micro-fabrication techniques
many others. The topic has been gaining much research interest
in computer graphics [Weyrich et al. 2009; Finckh et al. 2010; Papas et al. 2011; Kiser et al. 2012; Dong et al. 2010; Haˇsan et al.
2010; Matusik et al. 2009; Patow and Pueyo 2005; Patow et al.
2007; Weyrich et al. 2007; Malzbender et al. 2012]. In computer vision, photometric stereo algorithms can be improved if the surface
reflectance properties can be precisely controlled [Johnson et al.
2011]. Custom designed BRDFs can also help in appearance acquisition tasks such as a BRDF chart [Ren et al. 2011] and a planar
light probe [Alldrin and Kriegman. 2006].
Recent attempts to fabricate surfaces with custom reflectance functions boast impressive angular resolution, yet their spatial resolu-
Recent attempts to fabricate surfaces with custom reflectance functions boast impressive angular resolution, yet their
spatial resolution is limited. In this paper we present a method to construct spatially varying reflectance at a high
resolution of up to 220dpi , orders of magnitude greater than previous attempts, albeit with a lower angular resolution.
The resolution of previous approaches is limited by the machining, but more fundamentally, by the geometric optics
model on which they are built. Beyond a certain scale geometric optics models break down and wave effects must
be taken into account. We present an analysis of incoherent reflectance based on wave optics and gain important
insights into reflectance design. We further suggest and demonstrate a practical method, which takes into account
the limitations of existing micro-fabrication techniques such as photolithography to design and fabricate a range of
reflection effects, based on wave interference.
Sparse Representation for Face Recognition [8, 7]
Sparse Representation based Classification (SRC) has emerged as a new paradigm for solving recognition problems.
This paper presents a constraint sampling feature extraction method that improves the SRC recognition rate. The
method combines texture and shape features to significantly improve the recognition rate. Tests show that the combined constraint sampling and facial alignment achieves very high recognition accuracy on both the AR face database
(99.52%) and the CAS-PEAL face database (99.54%).
#
!"
SIFT for 3D Object [13, 12]
!"#$#%&'
(#)
Figure 6: Clustering results of a plane sampled on Gaussian Sphere
# $#*!"+#$%&'()*+
%&
'
(
$#)*
+,
.
/
&
0'/
.
,
+*
,1
234*
3
0.,
3
54*
.
6 1
0 73
+,
,
&
3
0 849.
/
.
Because of its invariance to image scaling, rotation, noise and illumination changes, SIFT algorithm is nowadays
widely applied in many aspects of the image study. The feature used by this algorithm
(SIFT feature) is a vector based8B
" /;()]]V,-^
,7 - %C^
,,--."
+ / ( which
./)
0 1a0
314
2 such as FCG
on local gradient,
can! resist
lot2
of"image
variation
extension, compression and rotation, meeting the
" 01`_`aRbK
" 4'ILcdabcRR
_feature
practical requirements
3D" 4
object
recognition.
Applying
in view space partition,
cutting the object from
56789#/(
3 - /of)
56
:"+)
7 / SIFT
S67 " A B 0 1 e f g h 8 d M T i 3 j c R
;89<= " : 3 ; 5 < > 6 # ? @ 5 , 7 - # A
8B
%C^_-RSQ " a4klmP5 !
B " C=!D E - 5 > 2 5 F G ? & ' $ & ' G @ 5
4e67nGopMT-M/;q[r:op
% " A42 5 & ' , * B C " / ( H - ) - 7
. :;<
fmsKg-klfm5/t- ! 4'01rPU
DIE.B C $ F C G " ( : H I J K / H - 6 #
LhX.()Ai j k l - m a n " o H J 3 p 5
JK? - / 6 7 " L - M N % ! 4 L : M / M % N
-qurv ! [\ K q u s t E u v n G w x y "
g Object Color Models from Multi-view Constraints
its background and pattern matching can effectively enhance the robustness of the system and improve its speed and
efficiency. With theoretical
analysis and experimental
verification, we prove the feasibility and advantages of applying
2
2
Saenko1 , Ayan
, Ying
Xiong
, Todd Zickler2 and Trevor Darrell1
SIFTChakrabarti
feature in 3D object
recognition
system.
1
UC Berkeley, 2 Harvard University
2
vor}@eecs.berkeley.edu
{ayanc@eecs,yxiong@seas,zickler@seas}.harvard.edu
Learning Object
Color Models from Multi-view Constraints [6]
stract
hly discriminative for many obs difficult to infer from unconlluminant is not known. Tradistancy can improve surface reh uncalibrated images, but their
on the background scene. In
eval applications, we have acain multiple views of the same
nts; we show in this paper that
hese images provide important
e color constancy. We introonstancy problem, and present
ates of underlying surface reimation of these surface propresent in multiple images. The
rrespondences obtained by varnd we show examples based on
es. Our results show that multiantly improve estimates of both
Figure 1. Top row: An object in three different environments with
ct color (surface reflectance)
Figure 7: Top row: An distinct
object inand
three
differentilluminants.
environments
with distinct
andlocal
unknown
illuminants. Bottom rows:
unknown
Bottom
rows: Five
regions
e single-view method.
Five local regions from the
same
object,object,
extracted
from five
uncontrolled
images images
like those
from
the same
extracted
from
five uncontrolled
likeabove, demonstrating the
extent of the variation inthose
observed
color.
Our goal is tothe
jointly
infer
true
and the unknown illuminant
above,
demonstrating
extent
ofthe
theobject’s
variation
incolors
observed
in each image.
color. Our goal is to jointly infer the object’s true colors and the
unknown illuminant in each image.
Color is known to be highly discriminative for many object recognition tasks, but is difficult to infer from unconhat color is verytrolled
discriminative
images in which the illuminant is not known. Traditional methods for color constancy can improve surface
and detection tasks,
contemporeflectance
estimates from such uncalibrated images, but their output depends significantly on the background scene.
of surface reflectances
intoa image
scene.sets
This
In manyfrom
recognition
retrieval applications,
we have access
thatmeans
containthat
multiple views of the same
to learn color models
im- andtribution
a
color
model
that
is
learned
for
a
particular
object
can
be
object
in
different
environments;
we
show
in
this
paper
that
correspondences
between
these
images provide important
ed conditions where the illumiconstraints
that
can
improve
color
constancy.
We
introduce
the
multi-view
color
constancy
problem, and present a
heavily
influenced
by
the
background
scenes
that
happen
to
cause illuminant color can have
method to recover estimates
of
underlying
surface
reflectance
based
on
joint
estimation
of
these
surface properties and
appear during training, and again, its utility is limited.
d image color (see
Fig. 1), and
the illuminants present in multiple images. The method can exploit image correspondences obtained by various alignnown, the learned
color
model and we show
Thisexamples
paper exploits
the simple
thatOur
when
ment
techniques,
based on matching
localobservation
region features.
results show that multiview
minative utility constraints
in new images.
learning
object
appearance
models
from uncontrolled
can significantly
improve
estimates
of both scene
illuminants
and object colorim(surface reflectance) when
compared
a baseline single-view
with this difficulty
is totocomagery, onemethod.
often has more than one training view avail-
ects by applying a traditional
o each training image. This aphowever, because methods for
cy—including various forms of
11], gamut mapping [9], and
ly on prior models for the dis-
able. Figure 1 shows an example where images of an object
are acquired in distinct environments with unknown lighting. When images like these are available, we can estab6
lish region-level correspondences
between images and use
multi-view color constraints to simultaneously improve our
object color model and our estimates of the illuminant color
169
References
[1] Ayan Chakrabarti, Ying Xiong, Steven J Gortler, and Todd Zickler. Low-level vision by consensus in a spatial
hierarchy of regions. arXiv preprint arXiv:1411.4894, 2014.
[2] Ayan Chakrabarti, Ying Xiong, Steven J. Gortler, and Todd Zickler. Low-level vision by consensus in a spatial
hierarchy of regions. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE,
2015.
[3] Ayan Chakrabarti, Ying Xiong, Baochen Sun, Trevor Darrell, Daniel Scharstein, Todd Zickler, and Kate Saenko.
Modeling radiometric uncertainty for vision with tone-mapped color images. arXiv preprint arXiv:1311.6887,
2013.
[4] Ayan Chakrabarti, Ying Xiong, Baochen Sun, Trevor Darrell, Daniel Scharstein, Todd Zickler, and Kate Saenko.
Modeling radiometric uncertainty for vision with tone-mapped color images. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 36(11):2185–2198, November 2014.
[5] Anat Levin, Daniel Glasner, Ying Xiong, Frédo Durand, William Freeman, Wojciech Matusik, and Todd Zickler.
Fabricating brdfs at high spatial resolution using wave optics. ACM Transactions on Graphics (TOG), 32(4):144,
2013.
[6] Trevor Owens, Kate Saenko, Ayan Chakrabarti, Ying Xiong, Todd Zickler, and Trevor Darrell. Learning object
color models from multi-view constraints. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on, pages 169–176. IEEE, 2011.
[7] Jing Wang, Guangda Su, Ying Xiong, Jiansheng Chen, Yan Shang, Jiongxin Liu, and Xiaolong Ren. Sparse
representation for face recognition based on constraint sampling and face alignment. Tsinghua Science and
Technology, 1:011, 2013.
[8] Ying Xiong. Face recognition algorithm based on lasso. Undergraduate thesis at Tsinghua University, 2010.
[9] Ying Xiong, Ayan Chakrabarti, Ronen Basri, Steven J Gortler, David W Jacobs, and Todd Zickler. From shading
to local shape. arXiv preprint arXiv:1310.2916, 2013.
[10] Ying Xiong, Ayan Chakrabarti, Ronen Basri, Steven J Gortler, David W Jacobs, and Todd Zickler. From shading
to local shape. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2014 (to appear).
[11] Ying Xiong and Yue M Lu. Blind estimation and low-rate sampling of sparse mimo systems with common
support. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages
3893–3896. IEEE, 2012.
[12] Ying Xiong and Huimin Ma. Extraction and application of 3d object sift feature. Journal of Image and Graphics,
5:018, 2010.
[13] Ying Xiong and Hui min Ma. Application of sift feature in 3d object recognition. In Conference on Image and
Graphics Technology and Applications (IGTA), 2009.
[14] Ying Xiong, Kate Saenko, Trevor Darrell, and Todd Zickler. From pixels to physics: Probabilistic color derendering. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 358–365.
IEEE, 2012.
7