Document 215229

Character retrieval and annotation in
multimedia - or "How to find Buffy"
Andrew Zisserman
(work with Josef Sivic and Mark Everingham)
Department of Engineering Science
University of Oxford, UK
http://www.robots.ox.ac.uk/~vgg/
The objective (I)
Retrieve all shots/clips in a video, e.g. a feature length film,
containing a particular person
Visually defined search – on faces
“Pretty Woman”
[Marshall, 1990]
Matching faces in video
“Pretty Woman” (Marshall, 1990)
Are these faces of the same person?
The objective (II)
Automatically annotate characters in video with their identity
“Pretty Woman”
[Marshall, 1990]
Edward Lewis
(Richard Gere)
Vivian
(Julia Roberts)
Requires additional information
Why this is difficult: uncontrolled viewing conditions
Image variations due to:
• pose/scale
• lighting
• partial occlusion
• expression
c.f. Standard face databases
Matching Faces
Are these images of the same person ?
Can be difficult for individual examples …
Matching Faces
Are these images of the same person ?
Easier for sets of faces
The benefits of video
Automatically associate expression exemplars
Approach
a. Minimize variations due to pose, lighting and partial occlusions
by choice of feature vector
Focus on near frontal faces (use frontal face detector)
b. Multiple faces to represent expressions
Set of faces associated by tracking
frames
c. Each identity represented by a distribution over a face set
Efficient matching or classification for sets of faces
...
Outline
1. Obtaining and matching face sets
•
aim is to obtain large faces sets
a. face detection and representation
b. obtaining face sets by tracking within shots
c.
matching face sets within and between shots
2. Application I: Retrieval of shots containing a particular person
•
Video Google - faces
3. Application II: Annotation of person identity
•
Use of text and speaker detection as weak supervision – multimedia
4. Extensions to human matching
a. Face detection and representation
Face detection - Viola & Jones [2001]
1. Detect face in a rectangular window of fixed size
•
Learn face/non-face classifier from training data
2. Scan window face detector over each pixel in image
(translation) and over multiple sizes (scale)
Classifier is learnt from labelled data
Training Data
• 5000 faces
ƒ All near frontal
• 108 non faces
• faces are normalized
ƒ scale, translation
Face detection
Operate at high precision (90%) point – few false positives
Matching faces
Easier if faces aligned to remove pose variation
face detector
eyes/nose/mouth
Eyes/nose/mouth detectors
Training data: ~5,000 images with hand-marked facial features
• Scale determined by face detector
• Fixed-size patches extracted around feature points
Constellation like Appearance/Shape Model
Model shape X (2-D points) and appearance A (patches at points in X)
• Appearance and shape are assumed independent
Appearance of a feature is modelled as a mixture of Gaussians (GMM)
Joint position of all features is modelled as a (mixture of) Gaussians
• Full covariance (positions of all features interact)
position
xi
GMM clusters
appearance
aj
Detect face features for rectification
Video with detected features
close-up
rectified face
Face normalization
• affine transform face using detected features
original detection
rectified
Representing faces
Matching improved if faces aligned to remove pose variation
face detector
eyes/nose/mouth
Compare faces by measuring
distance between vectors
SIFT descriptor [Lowe 1999]
rectified face
Create array of orientation histograms
8 orientations x 3x3 spatial bins = 72 dim.
Face feature vector - summary
Benefits of local SIFT descriptors:
SIFT unaffected by small localization errors in eyes/nose/mouth detector
Centre weighting de-emphasizes background (no foreground segmentation)
Illumination normalization per SIFT allows lighting to vary across face
multiple overlapping SIFTs
SIFT for each facial feature, i.e. 5 x 72 = 360 vector for entire face
b. Obtaining sets of faces using
tracking within shots
Face detection
Operate at high precision (90%) point – few false positives
Need to associate detections with the same identity
frames
Example – tracked regions
affine covariant regions
Affine-Harris
Maximally stable extremal regions
Tracking covariant regions – two stages
Goal: develop very long and good quality tracks
• Stage I – match regions detected in neighbouring frames
Problems: e.g. missing detections
• Stage II – repair tracks by region propagation
Region tubes
Connecting face detections temporally
Goal: associate face detections of each character within a shot
Approach: Agglomeratively merge face detections based on connecting ‘tubes’
frames
require a minimum number of region tubes to overlap face detections
Alternatives: KLT, Avidan CVPR 01, Williams et al ICCV 03
Example: Buffy the Vampire Slayer
Breakfast Scene
raw face
detections
face tubes
c. Matching face sets
Matching face sets
Representation of face set
• represent tube by set of 360-vectors
“face space”
Matching face sets within a shot
min-min distance: dist(A, B) =
min
a∈A,b∈B
A, B ... sets of face descriptors (360-vectors)
dist(a, b)
Matching face sets within shot
Goal: Match face tubes of a particular person within a shot
(to overcome occlusions, self-occlusions)
Approach: Agglomeratively merge facesets using min-min
distance with exclusion constraints.
Exclusion principle: The same character cannot appear twice
in the same frame
face tubes
(tracking only)
intra-shot
matching
Application I: Video Google – faces
The objective
Retrieve all shots/clips in a video, e.g. a feature length film,
containing a particular person
Visually defined search – on faces
“Pretty Woman”
[Marshall, 1990]
Applications:
• intelligent fast forward on characters
• pull out all videos of “x” from 1000s of digital camera mpegs
Preliminaries
Film statistics for Pretty Woman
• 170,000 frames
• 1151 shots
Pre-processing
• Track local regions through every shot
• Detect faces in every frame using a `frontal’ face detector
(38,0457 face detections)
• Obtain face tubes by tracking (659 face tubes)
• After intra-shot matching (plumbing) (611 face tubes)
Face tube – representation as single vector
Descriptor for
each face
Compact representation of
the entire face set
Representation of face set – II
• obtain compact representation for the entire face tube
• treat face descriptors as samples from an underlying unknown pdf
• represent face tube as a histogram over face exemplars
(non-parametric model of pdf)
Making the search efficient (Google like retrieval)
Represent video by histogram over facial feature exemplars for each face tube
42
facial
feature
exemplars
5
face tubes
Counts pk
Each column is a normalized histogram
Exemplars
cf words vs documents (e.g. web pages) in text retrieval
Employ text-retrieval techniques e.g.
• Ranking (here on chi-squared)
Demo
“Pretty Woman”
“Groundhog Day”
“Casablanca”
“Charade”
Retrieval example I – Pretty Woman
Query sequence
Select a face
Detail I
Query sequence
Retrieved sequences (shown by first detection)
Example
sequence
Retrieval example II – Pretty Woman
Query sequence
Detail II
Query sequence
Retrieved sequences (shown by first detection)
Examples of recognized faces
Retrieval example III – Groundhog Day
Query sequence
Retrieval example IV – Casablanca
Query sequence
Example: Matching across movies
Bill Murray
Lost in translation
Groundhog Day
[Coppola 2003]
[Ramis 1993 ]
Lost in translation - query
Query shot
Example face detection
192 associated
face detections
Find Bill Murray in ‘Groundhog Day’
Face detections from the first 36 retrieved face tracks:
First false positive ranked 42nd, 15 false positives in the first 100 retrieved face tracks
(out of total of 596 face tracks )
Application II: Video Annotation
"Names and Faces in the News",
Berg et al, CVPR, 2004
Also, Alex Hauptmann’s slides
• weak supervision from text
• correspondence problem
Aims
ƒ Automatically label appearances of characters in
TV/movies (“Buffy the Vampire slayer”) with their
names
Willow
Buffy
ƒ Need to learn appearance and names
ƒ No supervision from the user, use only
readily-available annotation
Textual Annotation: Subtitles/Closed-captions
ƒ DVD contains timed subtitles as bitmaps
ƒ Automatically convert to text using simple OCR
ƒ What is said, and when, but not who says it
Textual Annotation: Script
ƒ Many fan websites
publish transcripts
ƒ Automatically extract text
from HTML
ƒ What is said, and who says it, but not when
Alignment by Dynamic Time Warping
ƒ Script has no timing but sequence is preserved
ƒ Efficient alignment by dynamic programming
Subtitle words
Script words
Path maximizing number of
matching words
Subtitle/Script Alignment
ƒ Alignment of what allows subtitles to be tagged with
identity giving who and when
ƒ “Dynamic Time Warping” algorithm
00:18:55,453 --> 00:18:56,086
Get out!
HARMONY
Get out.
00:18:56,093 --> 00:19:00,044
- But, babe, this is where I belong.
- Out! I mean it.
SPIKE
But, baby... This is where I belong.
00:19:00,133 --> 00:19:03,808
I've been doing a lot of reading,
and I'm in control of my own power now,...
00:19:03,893 --> 00:19:05,884
..so we're through.
HARMONY
Out! I mean it. I've done a lot of
reading, and, and I'm in control
of my own power now. So we're
through.
Ambiguity
ƒ Knowledge of speaker is a weak cue that the
character is visible
Tara ?
Tara ?
Multiple characters
Buffy ?
Speaker not detected
Tara ?
Tara ?
Speaker not visible
ƒ Ambiguities will be resolved using vision-based
speaker detection
Face Association
ƒ Measure “connectedness” of a pair of faces by point tracks
intersecting both
ƒ Doesn’t require contiguous detections
ƒ Independent evidence – no drift
KLT tracker
Tracking faces in spatio-temporal video volume
Automatically associated facial
exemplars
Example Face Tracks
Facial Feature Localization
ƒ Stabilize representation by localizing features
ƒ Pose of face varies and face detector is noisy
ƒ Extended “pictorial structure” model
ƒ Joint model of feature
appearance and position
Speaker Detection
ƒ Subtitles/script gives the speaker’s name
ƒ Identify who (if anyone) in the video is speaking
Speaking?
ƒ In this frame, the subtitles/script says Willow is
speaking. If this person is speaking, it must be
Willow.
Speaker Detection
ƒ Measure the amount of motion of the mouth
ƒ Search across frames around detected mouth points
0.014
0.012
SSD
0.01
Speaking
0.008
“Don’t Know”
0.006
0.004
0.002
0
0
20
40
60
frame
80
100
Not Speaking
Resolved Ambiguity
ƒ When the speaker (if any) is identified, the
ambiguity in the textual annotation is resolved
Tara ?
Tara ?
Buffy ?
Tara ?
Tara ?
Tara
Speaking
Non-speaking
Non-speaking
Non-speaking
Non-speaking
Exemplar Extraction
ƒ Face tracks detected as speaking and with a single
proposed name give exemplars
Buffy
Willow
Xander
2,300 faces
1,222 faces
425 faces
ƒ Assign names to unlabelled faces by classification
based on extracted exemplars
Classification by Exemplar Sets
ƒ Classify tracks by nearest exemplar
ƒ Estimate probability of class from distance ratios
ƒ Refuse to predict names for uncertain tracks
Willow
Willow
Confident
?
Buffy
Xander
Buffy
Uncertain
Xander
“Refusal to predict”
ƒ Want to be able to leave some uncertain face tracks
unlabelled
ƒ An approximate posterior probability of the label λ for a
face track F is defined
where
ƒ Only tracks for which the posterior exceeds a threshold are
labelled (also gives a ranking)
Experiments
ƒ Tested on two episodes
ƒ 130,000 frames, 50,000 faces, 1,000 tracks
ƒ Methods
ƒ Proposed method
ƒ Prior: label all faces with the name occurring most
frequently in the script (Buffy)
ƒ Subtitles only: Label any tracks with proposed names
(not using speaking detection) as one of the proposed
names, breaking ties by prior.
Example Results
ƒ No user involvement, just hit “go”…
Buffy
Joyce
Dawn
Willow
Buffy
Buffy
Buffy
Giles
Dawn
Willow
Buffy
Willow
Other
Tara
Tara
Willow
Willow
Example Results
ƒ Wide range of pose, expression, lighting…
Tara
Dawn
Tara
Willow
Buffy
OtherOther Harmony
Xander
Anya
Riley
Buffy
Harmony
Buffy
Dawn
Spike
Precision/Recall
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Precision
Precision
ƒ Recall is proportion of face tracks assigned a name
ƒ Precision is proportion of correct names
0.5
0.4
0.3
0.4
0.3
0.2
0.2
Prior (buffy)
Subtitles only
Proposed method
0.1
0
0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Prior (buffy)
Subtitles only
Proposed method
0.1
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Recall
Recall
Episode 05-02
Episode 05-05
0.8
0.9
1
Example Video
ƒ Labelling at 100% recall (all faces labelled)
ƒ 1,900 frames, 2 errors (1 non-face, 1 wrong name)
Quantitative Results
ƒ Speaker detection: ~25% of tracks labelled with
90% accuracy
ƒ Naming: ~80% precision at 80% recall
ƒ ~70% precision at 100% recall
Episode 05-02
Recall:
Proposed method
60%
87.5
80%
78.6
90%
72.9
Episode 05-05
100%
68.2
60%
88.5
80%
80.1
90%
75.6
100%
69.2
Subtitles only
45.2
45.5
Prior (buffy)
21.3
36.9
ƒ Subtitles only (no vision): ~45% precision
Using an SVM classifier
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Precision
Precision
• Large margin, smooth discriminant
• Explicit robustness to outliers
0.5
0.4
0.4
0.3
0.3
0.2
0.2
NN−Auto
NN−Manual
SVM
0.1
0
0
0.5
0.1
0.2
0.3
0.4
0.5
0.6
Recall
Episode 05-02
0.7
0.8
0.9
NN−Auto
NN−Manual
SVM
0.1
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall
Episode 05-05
0.7
0.8
0.9
1
Summary
ƒ Quite accurate labelling using only readily-available
textual annotation
ƒ No user involvement, just hit “go”…
ƒ Vision-based speaker detection essential
ƒ Cues from subtitles alone are very weak
ƒ Further work
ƒ
ƒ
ƒ
ƒ
Generalization across episodes/movies
Extension to non-frontal faces
Better models of appearance and fusion
Learning recognition of actions and interactions
4. Extensions to human matching
Improving Coverage – beyond frontal faces
Use of a frontal face detector limits the data which can be
labelled
Non-frontal views
• Multi-view face detection [Klaeser, Schmid, Dalal & Triggs]
• Facial feature localization in profile
• Speaker detection in profile
• Transfer of labels from speaker detection and
classification between frontal/profile views by tracking
Feature Localization & Speaker Detection
Pictorial structure and speaker detection methods adapted
successfully to profile views
Profile Speaker Detection
Transfer of frontal/profile speaker detections expands
available annotation for both views
Extensions to human matching (II)
Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs, CVPR 2005
Extensions to human matching (II)
Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs, CVPR 2005
Extensions to human matching (III)
Strike a Pose: Tracking People by Finding Stylized Poses
Deva Ramanan, David Forsyth and Andrew Zisserman, CVPR 2005
Extensions to human matching (III)
Strike a Pose: Tracking People by Finding Stylized Poses
Deva Ramanan, David Forsyth and Andrew Zisserman, CVPR 2005
edges
walking
pose
pictorial
structure
efficient
matching
Build Model
small scale
find
discriminative
features
torso
bg
learn
limb
classifiers
(limb pixels alone
are poor model)
unusual pose
Build Model & Detect
torso
small scale
arm
learn
limb
classifiers
label
pixels
le g
head
unusual pose
general
pose
pictorial
structure
Running Example
How well do classifiers generalize?
References
Sivic, J. , Everingham, M. and Zisserman, A.
Person spotting: video shot retrieval for face sets
International Conference on Image and Video Retrieval (CIVR),
Singapore (2005)
http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic05a.pdf
Everingham, M., Sivic, J. and Zisserman, A.
Hello! My name is... Buffy - Automatic Naming of Characters in TV
Video
British Machine Vision Conference (2006)
http://www.robots.ox.ac.uk/~vgg/publications/papers/everingham06a.pdf
Demo for “Charade”:
http://www.robots.ox.ac.uk/~vgg/research/fgoogle/
END