Document

RECOGNISING INTERACTION
ACTIVITY
WEI-SHI ZHENG
郑伟诗
http://sist.sysu.edu.cn/~zhwshi/
SUN YAT-SEN University
1
Introduction to My Research
2
OUTLINE
1
Introduction
2
Human-Object Interaction
3
Collective Activity
4
Summary
3
1. Background

Why Learning Interaction Activity
A woman drinks using a cup
Human+ Object +Interactions
A woman smashes a volleyball
4
1. Introduction

Can you guess what is happening to
them?
5
1. Introduction

Are you right?
6
1. Introduction

Why Learning Interaction Activity
We are interacting
with others everyday
7
1. Introduction

The Challenges
Can you detect
the ball
accurately?
8
1. Introduction

The Challenges
Different Poses
9
1. Introduction

The Challenges
Queuing??
Talking??
Local does not
mean global
≈
Individual Action
≠
Collective
Activity
Talking!!
Queuing!!
10
1. Introduction

Our work in this presentation
HUMAN -OBJECT Interaction
(ICCV 2013, IEEE TCSVT 2015; CVPR 2015)
Collective Activities: Less People (Less Occluded), More Interactive
(IEEE TIP 2015)
11
OUTLINE
1
Introduction
2
Human-Object Interaction
Jian-Fang Hu, Wei-Shi Zheng*, Jian-huang Lai, Shaogang Gong,
and Tao Xiang. Exemplar-based Recognition of Human-Object
Interactions. IEEE Transactions on Circuits and Systems for Video
Technology (DOI: 10.1109/TCSVT.2015.2397200), 2015. (ICCV 2013)
3
Collective Activity
4
Summary
12
2. Human-Object Interaction: Exemplar
Related work: Human-Object
Mining object patterns from many images
 Detecting human and calculating their relationship.

[Prest and Cordelia Schmid, TPAMI 2012]
13
2. Human-Object Interaction: Exemplar
Related work: Mutual Context

Using a mutual
context model to
jointly model
interactions
between objects
and human
poses.
[Yao & Fei-Fei,
TPAMI 2012]
14
2. Human-Object Interaction: Exemplar
Related work: Visual Phrases

Using complex visual composites and object
models to make independent predictions.
[Sadeghi , CVPR2011]
15
2. Human-Object Interaction: Exemplar

Summary: Two popular models


Geometric relationship (distance and angle)
Binary patterns
heavily relying on accurate
and robust object detection
and pose estimation.

Our proposed model



Pose-Object: a probability framework
Including semantic information
Overcoming the inaccurate detection and pose
estimation
16
2. Human-Object Interaction: Exemplar
Our Observations

Motivation:


For specific pose, the manipulated object would appear at similar
relative positions.
Modelling the exemplar as a Gaussian function
17
2. Human-Object Interaction: Exemplar
Learned Exemplars
Observations:
 an atomic pose can interact with two objects or even more.
 an object can interact with multiple atomic poses.
 each pair of pose and object corresponding to one exemplar.
18
2. Human-Object Interaction: Exemplar

Spatial Pose-Object Interaction Using Exemplars
object
atomic pose
19
2. Human-Object Interaction: Exemplar
Exemplar-based representation for Video
20
2. Human-Object Interaction: Exemplar

Visualisation
21
2. Human-Object Interaction: Exemplar

Exemplar-based HOI Representation
The appearance interaction response
Object Detection Score (O)
Pose Appearance (P)
Spatial Pose-Object
Interaction (I)

Contextual Information (C)
Action-specific Ranking Matching Model
Assign higher score to
the ground truth label
22
2. Human-Object Interaction: Exemplar

Action-specific Ranking Matching Model
=
=
23
2. Human-Object Interaction: Exemplar
Evaluation
 Sports


PPMI


24-class, 100 images for training and 100 for testing
Gupta


6-classes, 300 images, 30 images for training and 20 for
testing
6-classes, 60 videos
SYSU Action dataset
(new collection,
http://sist.sysu.edu.cn/~zhwshi/students/jianfang/HomePage.htm

)
6-class, 119 videos, sitting and standing
24
2. Human-Object Interaction: Exemplar
Samples of SYSU Set
25
2. Human-Object Interaction: Exemplar
Evaluation of HOI descriptor

Observations:



Our spatial exemplar-based HOI feature has an important
impact, especially on the SYSU action.
The scene context is important in the two still image datasets
(Sports and PPMI).
The object context is important in Gupta set.
26
2. Human-Object Interaction: Exemplar

Sports

PPMI

Gupta & SYSU Action
27
2. Human-Object Interaction: Exemplar
28
OUTLINE
1
Introduction
2
Human-Object Interaction
Jian-Fang Hu, Wei-Shi Zheng*, Jian-huang Lai, Jianguo Zhang,
"Jointly Learning Heterogeneous Features for RGB-D Activity
Recognition," IEEE Conf. on Computer Vision and Pattern
Recognition, June 2015.
3
Collective Activity
4
Summary
29
2. Human-Object Interaction: RGBD

Depth-based Modelling
 insensitive to illumination variations
 invariant to color and texture changes
 reliable for estimating body silhouette
and skeleton (human posture)
Kinect Device

Example sequence captured by Kinect
RGB
Depth
Skeleton
30
2. Human-Object Interaction: RGBD
Related work: Depth-based Descriptor

Discriminative Depth Descriptors
Comment: Do not use
 HON4D [Oreifej, CVPR2013]
color information to
facilitate recognition




Super normal [Yang,CVPR2014]
Actionlet [Wang, TPAMI2014]
Range Sample [Lu, CVPR2014]
……
31
2. Human-Object Interaction: RGBD
Related work: RGB-D based

Fusion of RGB features and Depth features




Deep Learning[Shao, IJCAI, 2013],
Random Forests[Zhu, CVPRW, 2013],
Structured sparsity model [Shahroudy,
ISCCSP2014]
…….
Comment: Ignore the connections between RGB and Depth channels
32
2. Human-Object Interaction: RGBD
Our Observations
The RGB and Depth channel do share some structures.
Heterogenous features extracted from the Depth and RGB channel should be
jointly learned with shared and specific structures considered.
33
2. Human-Object Interaction: RGBD
Our Joint learning model
Fig. A graphic illustration of our joint learning framework
34
2. Human-Object Interaction: RGBD
Model Formulation (JOULE)
--- joint heterogeneous features learning
Reconstruction Loss Regularization term
Prediction Loss
S
 ( (W0  Wi ) i X i  Y
T
min
W0 ,{Wi },{ i }
i 1
2
  X i  i i X i
T
T
F
2
F
  Wi F )   W0
2
s.t., i i  I , i  1,2,..., S
T
Orthogonality constrains
 Jointly learning the shared and specific structure among
different features.
 Jointly learning shared structure across different activity
classes.
 Learning structures guided by recognition
35
2
F
2. Human-Object Interaction: RGBD
Model Optimization

Propose a three-step iterative optimization
algorithm.

Until
convergence!


36
2. Human-Object Interaction: RGBD
Experiments

MSR Daily


CAD 60


16 activities, 320 video clips, 10 participants
14 activities, 68 video clips, 4 participants
SYSU 3D HOI (new collection, will release soon):

12 HOI activities, 480 video clips, 40 participants
37
2. Human-Object Interaction: RGBD

Samples of SYSU Set
More objects to manipulate; more participants; more similar motions
38
2. Human-Object Interaction: RGBD
Results on MSRD Set
39
2. Human-Object Interaction: RGBD
Results on CAD60
40
2. Human-Object Interaction: RGBD
Results on SYSU 3D HOI Set

Setting-1, half samples of each activity class were selected for
training and the rest for testing.
Setting-2 is the cross-subject setting.
41
2. Human-Object Interaction: RGBD
Parameter Evaluation
Fig. Effects of parameter
subspace dimension on
the system performance.
42
OUTLINE
1
Introduction
2
Human-Object Interaction
3
Collective Activity
Xiaobin Chang, Wei-Shi Zheng*, and Jianguo Zhang. Learning
Person-Person Interaction in Collective Activity Recognition. IEEE
Transactions on Image Processing, vol. 24, no. 6, pp. 1905-1918,
2015.
4
Summary
43
3. Learning Person–Person Interaction

Related Work: Spatial Temporal Model
Choi et al 09’ICCVW
1.Capturing the Spatial
Distribution of Collective
Activity.
2.Capturing the Temporal
Variation of the Spatial
Distribution .
44
3. Learning Person–Person Interaction

Related Work: Spatial Temporal Model
Choi et al 11’ CVPR
Capturing the Spatial Temporal Information and finding out the most
Discriminative ones for Collective Activity Recognition as well.
45
3. Learning Person–Person Interaction

Related Work: Hierarchical Model
Lan et al 12’ TPAMI
1.Collective Activity is based on the action of each person.
2.The connections among people can be inferred as latent variables.
46
3. Learning Person–Person Interaction

Related Work: Hierarchical Model
Amer et al 13’ICCV
1. Three layers are used.
They are Object level,
Individual Action level
and Collective Activity
Level, from bottom to top.
2. An And-Or Graph is
used for modelling these
three layers.
47
3. Learning Person–Person Interaction

Related Work: Interactive Phrase
Kong et al 13’TPAMI
1.Describing the person-person interaction by capturing the interaction
patterns by exploiting motion relationships between body parts.
2. The interaction is inexplicitly captured by the model.
48
3. Learning Person–Person Interaction

Related Work: Interactive Descriptor
Tran et al 14’ Pattern Recognition Letters
A Descriptor called LGA is used to capture the interactions among people
for Collective Activity Recognition.
49
3. Learning Person–Person Interaction

Related Work: Combine Model
Choi et al 14’TPAMI
1. This model combines Different Tasks together:
Collective Activity Recognition, Interaction Recognition, Individual Action
Recognition, as well as Multi-people Tracking.
2. It believes different tasks can benefit from each others during learning
procedure.
3.Hard to be optimised & Required many manual labels.
50
3. Learning Person–Person Interaction

A Complete Learning Approach
Spatial-Temporal
Focused on Modeling
Feature
Person-Person Interaction
Of Each Person’s
Action
Two connected atomic activities in one collective activity are either:
1) quite similar and spatially close to each other to form a meaningful
collective activity (e.g. two people are walking together);
2) not quite similar but are strongly interacting to each other
(e.g. facing each other when two people are talking, or fighting).
Short Video
Clips
(~15 frames)
51
3. Learning Person–Person Interaction

Inference
person-person interaction
under the collective activity m
Interaction Response:
Summarising all the
person-person
interactions
Inference:
The Interaction Response
should be maximised under
the ground-truth collective activity
52
3. Learning Person–Person Interaction

Learning
Matrix
Factorisation:
Advantages:
1. More Effective
2. Low-rank Representation
- log det regularisation
Term: Lm is of full rank.
(Avoid the redundant
problem)
53
3. Learning Person–Person Interaction

Multi-task Extension
Different Collective Activities are Different but Related.
Class-Specific: Global Interaction is different;
Shared Aspects: Local Interaction is sometimes similar
Person Actions(standing, walking),
54
Spatial Distribution, etc.
3. Learning Person–Person Interaction

Multi-task Extension
α controls the balance between
shared parameter and classspecific parameter in Ωm
for modelling Class-Specific Information
for modelling Shared Information
55
3. Learning Person–Person Interaction

Multi-task Extension
Gradient Descent &
Interactive Optimization
56
3. Learning Person–Person Interaction

Two Benchmark Datasets
1. Collective Activity Dataset (CAD)
 44 video sequences; 5 activities (crossing, waiting, queuing, walking, talking);
 Exp. Setting: random splits 1/4 of the dataset for testing and the rest for training.
2. Choi’s Dataset
 32 video sequences; 6 activities(gathering, talking, dismissal, walking together,
chasing, and queuing);
 Exp. Setting: the standard experimental protocol of the 3-fold cross validation.
57
3. Learning Person–Person Interaction

Visualization of the Learned Ω
CAD dataset:
Choi’s dataset:
58
3. Learning Person–Person Interaction

Results On Two Benchmark Datasets
CAD:
Choi’s
Dataset:
59
3. Learning Person–Person Interaction

Parameter Evaluations:
The impacts of α and d on CAD
The impacts of α and d on Choi’s Dataset
α varies from 0.1~0.9; d = {32,64,96,128,256,384}
60
3. Learning Person–Person Interaction

Effect of logdet regularization
The impact of β on both datasets
The performances fall obviously
without –logdet regularization(β = 0);
The performances become stable
when β = 0.3
61
OUTLINE
1
Introduction
2
Human-Object Interaction
3
Collective Activity
4
Summary
62
Summary

Two types of interaction analysis are
discussed
HUMAN-OBJECT
 COLLECTIVE ACTIVITY


Present Three Learning Models as well
Action Specific Ranking
 Heterogeneous Fusion
 Person-Person Interaction Learning


Contribute Two New Databases
63
Summary

References
Xiaobin Chang (student), Wei-Shi Zheng*, and Jianguo Zhang.
Learning Person-Person Interaction in Collective Activity Recognition.
IEEE Transactions on Image Processing, vol. 24, no. 6, pp. 1905-1918,
2015.
Jian-Fang Hu (student), Wei-Shi Zheng*, Jian-huang Lai, Shaogang
Gong, and Tao Xiang. Exemplar-based Recognition of Human-Object
Interactions. IEEE Transactions on Circuits and Systems for Video
Technology (DOI: 10.1109/TCSVT.2015.2397200), to appear, 2015.
Jian-Fang Hu (student), Wei-Shi Zheng*, Jian-huang Lai, Jianguo
Zhang, "Jointly Learning Heterogeneous Features for RGB-D Activity
Recognition," IEEE Conf. on Computer Vision and Pattern
Recognition, June 2015.
64
感谢

感谢我的学生积极参与这方面的研究
胡建芳
常晓斌
特别感谢董乐老师,祝您生日快乐!
 感谢VALSE ONLINE提供的机会!
 感谢大家在复活节假期参与这个报告!

Q&A
65