Detecting people and tracking them - Lihi Zelnik

Pedestrian Detection and Tracking
Lihi Zelnik-Manor
Video Analysis Course
Summer 2008
Credits
• Some slides were adapted from Payam
Sabzmeydani and Greg Mori
• Some slides were adapted from Deva
Ramanan
Problem
Classify a window as pedestrian or non-pedestrian
Search exhaustively the scale-space image
Viola, Jones & Snow
• www.merl.com/projects/pedestrian/
Recall: Viola & Jones features
Motion features
Shift operators
Differences between shifted images:
Motion features
Motion features
∆
U
L
R
D
Features type 3
Sums within a filter.
Captures:
Motion magnitude
f j = φ j (U , L, R, D )
U
Features type 1
Sum differences across filters
Captures:
Likelihood of region moving in
U,L,R,D direction
f i = ri (∆ ) − ri (U , L, R, D )
∆
U
Features type 2
Differences within a filter.
Captures:
Motion shear.
f j = φ j (U , L, R, D )
U
Features type 4 (appearance)
Differences on input frame
Captures:
Appearance
f j = φ j (U , L, R, D )
Features & Classifiers
Features:
Weak classifiers:
Parameters are learned via ada-boost
Details
Cascade classifier for speed-up
Dataset and training
8 video sequences
~2000 frames each
6 for training
2 for testing
2250 positive examples
2250 negative examples
Pedestrian =
20x15 box
Top filters
The first 5 filters
learned for the dynamic
pedestrian
detector.
Top appearance filters
The first 5 filters
learned using
appearance only
Results
Comparing dynamic classifier and appearance classifier
Test sequence 1
Test sequence 2
Appearance based detection
Motion + appearance based detection
Using a single frame
Different cues
Wavelet coefficients (Mohan et al., PAMI 2001)
Oriented gradients (Dalal and Triggs, CVPR 2005)
SIFT features (Leibe et al., CVPR 2005)
Edgelet features (Wu and Nevatia, ICCV 2005)
“Shapelet features” (Sabzmeydani and Mori, CVPR 2007)
Datasets
• MIT : Standing pose, simple background, no
occlusion
• INRIA : Standing pose, complex background, partial
occlusions
22
Dalal & Triggs, CVPR’05
Concatenated histograms of
local gradients
SVM
Dense sampling
Claim:
“none of the keypoint detectors that we
are aware of detect human body
structures reliably”
Dalal & Triggs, CVPR’05
average
gradient
image
maximum maximum
positive negative
SVM
SVM
weight
weight
HOG
HOG
descriptor descriptor
weighted weighted
by
by
positive negative
HOG
weights
descriptor weights
Wu & Nevatia, ICCV’05
• Edgelet features: short line and curve segments
• AdaBoost
Sabzmeydani and Mori, CVPR’07
• Shapelet features: combinations of short line and
curve segments
• AdaBoost
Start from smoothed gradient
responses in different directions
Shapelet features
Shapelet final classifier
Results on INRIA dataset
People detection in video
• Tracking:
• Background Subtraction
• Condensation
• Explicit Motion Models
Ramanan & Forsyth
Tracking People by Learning their Appearance
Look for candidate torsos
Using a template
rectangle detector
Cluster torsos
Final torso detector
Find arms and legs
Using a template
rectangle detector near
detected torsos
+
Clustering
(as for torso)
Results
Weaknesses
• The clustering step only works for sequences
where :
• limbs are reliably found by low-level detectors
• limbs look different from the background.
• If the algorithm produces bad clusters, the
resulting appearance models will produce
poor tracks.
Detect whole person
Stylized pose person detector
Person model
Build Model & Detect
Detection
Detect people by sampling from a one-leg, one-arm pictorial
structure
Results
Results
Ramanan et al. summary
• Strengths
• Works in spite of camera motion
• Robust to drift
• Auto-initializing
• Weaknesses
• Not applicable to real-time applications
• Makes use of many heuristics (too many?)
• May have problems dealing with lighting changes