Download Report

Z. Cheng, Dynamic Human Shape Description and Characterization
Dynamic Human Shape Description and Characterization
Z. Cheng*†, S. Mosher†, Jeanne Smith† H. Cheng‡, and K. Robinette‡
† Infoscitex Corporation, Dayton, Ohio, USA
th
‡ 711 Human Performance Wing, Air Force Research Laboratory, Dayton, Ohio, USA
Abstract
Dynamic human shape description and characterization was investigated in this paper. The dynamic shapes of a
subject in four activities (jogging, limping, shooting, and walking) were generated via 3-D motion replication.
The Paquet Shape Descriptor (PSD) was used to describe the shape of the subject in each frame. The unique
features of dynamic human shapes were revealed from the observations of the 3-D plots of PSDs. The principal
component analysis was performed on the calculated PSDs and principal components (PCs) were used to
characterize the PSDs. The PSD calculation was then reasonably approximated by its first few projections in the
eigenspace formed by PCs and represented by the corresponding projection coefficients. As such, the dynamic
human shapes for each activity were described by these projection coefficients. Based on projection coefficients,
data mining technology was employed for activity classification. Case studies were performed to validate the
methodology developed.
Keywords: Human Modeling, Dynamic Shape, Shape Descriptor, Principal Component Analysis, Activity
Recognition
1. Introduction
While a human is moving or performing an action,
his body shape is changing dynamically. In other
words, the shape change and motion are tied
together during a human action (activity). However,
human shape and motion are often treated
separately in activity recognition. The shape
dynamics describe the spatial-temporal shape
deformation of an object during its movement and
thus provide important information about the
identity of a subject and the motions performed by
the subject (Jin and Mokhtarian, 2006). A few
researchers utilized shape dynamics for human
activity recognition. In (Kilner et al, 2009), the
authors addressed the problem of human action
matching in outdoor sports broadcast environments
by analyzing 3-D data from a recorded human
activity and retrieving the most appropriate proxy
action from a motion capture library. In (Niebles
and Li 2007), a video sequence was represented as
a collection of spatial and spatial-temporal features
by extracting static and dynamic interest points;
then a hierarchical model was proposed that can be
characterized as a constellation of bags-of-features,
both spatial and temporal. In (Jin and Mokhtarian
2006), a system was proposed for recognizing
object motions based on their shape dynamics. The
spatial-temporal shape deformation in motions was
captured by hidden Markov models. In (Blank et al
2005), human action in video sequences was seen
as silhouettes of a moving torso and protruding
limbs undergoing articulated motion. Human
actions were regarded as three dimensional shapes
induced by the silhouettes in the space-time
volume.
Dynamic human shapes can be described by a
dynamic 3-D human shape model which, in turn,
can be extracted from 2-D video imagery or 3-D
sensor data or created by 3-D replication/animation.
A dynamic 3-D shape model usually contains tens
of thousands of graphic elements (vertices or
polygons). In order to use the information coded in
dynamic shapes for human identification and
activity recognition, it is necessary to find an
effective method for dynamic shape description and
characterization.
2. Dynamic Shape Creation
Since the technologies that are capable of capturing
3-D dynamic shapes of a subject during motion are
still very limited in terms of their maturity and
availability, there is very little data available at this
time for human dynamic shapes. However, as a
motion capture system can be used to capture
human motion and a laser scanner can be used to
*Corresponding author. Email: [email protected]
1
Z. Cheng, Dynamic Human Shape Description and Characterization
capture human body shape, various techniques have
been developed to replicate/animate human motion
in a 3-D space, thus generating dynamic shapes of a
subject in an action. In this paper, Blender
(http://www.blender.org), an open source software
tool was used to animate the motion of a human
subject in a 3-D space during four different
activities.
The four activities were walking, jogging, limping
and shooting. The data that were used as a basis for
the animation were acquired in the Human
Signatures Laboratory of the US Air Force,
including scan data and motion capture (MoCap)
data. The human subject, with markers attached,
was scanned using the Cyberware whole body
scanner. Motion capture data were acquired for the
same subject with the same markers attached. The
markers allowed the joint centers to be determined
for both the scan and the MoCap data.
The scan was imported into Blender and the joint
centers were used to define the skeleton in BVH
(Bio-vision Hierarchical) file format. Euler angles
for the different body segments were computed
from the joint centers and other markers and used to
set up BVH files for the four different activities.
The BVH files were imported into Blender and
used to animate the whole body scan of the subject.
Figure 1 shows the images captured from the
animation within Blender for the four activities.
From Blender animation, a 3-D mesh can be output
at each frame of motion, as shown in Fig. 2 for
limping, which can be used to represent the 3-D
dynamic body shape of the subject at this instant of
motion. Thus, the output of the 3-D mesh in each
frame can be used as the simulation data of
dynamic human shapes for training the algorithms
developed for activity recognition.
3. Dynamic
Shape
Characterization
Description
calculation are omitted here. The 3-D plots of the
time histories of PSD for four activities are
illustrated in Fig. 4 where the first 40 bins
corresponding to radius are show on the top, the
second 40 bins of cos(θ) in the middle, and the last
40 bins of cos(δ) at the bottom.
Figure 1. Replication of a subject in four activities:
limping, jogging, shooting, and walking.
Figure 2. Dynamics shapes of a subject during
limping.
Example Cord
r
P1
Left Arm
θ
(0,0,0)
)
δ
P3
P2
Chest
Figure 3. Paquet shape descriptor and its coordinate
system.
and
Dynamic shapes shown in Figs. 1 and 2 are
represented by 3-D meshes. Each mesh may contain
as many as tens of thousands of graphical elements
(vertices or polygons). It is not feasible to use the
vertices or polygons directly for the analysis of
human shape dynamics. One way to effectively
describe dynamic shapes and to enable further
analysis is by using a shape descriptor (Cohen & Li
2003; Chu & Cohen 2005). In this paper, the Paquet
Shape Descriptor (PSD) (Paquet et al 2000;
Robinette 2003) with certain modifications is used
to describe dynamic shapes and to analyze shape
dynamics. As illustrated in Fig. 3, the PSD uses 120
bins (discrete parameters) to characterize shape
variation. Among these 120 bins, 40 are related to
the radius r, 40 to the first angle (cos(θ)), and 40 to
the second angle (cos(δ)). The details of the PSD
By visually looking at these plots, one can find that:
 The variation of each bin over time is different:
the variations of some bins over time are larger
and significant, but others are not.
 Periodic features are exhibited by the plots for
the activities of jogging, limping, and walking.
 The 3-D plot for each activity is unique. There
are visible and significant differences among
the plots for four activities.
These observations from the PSD reveal some
unique features of shape dynamics. However,
directly using PSD to analyze shape dynamics is
still not feasible, since it has 120 bins (variables)
which form a space of 120-dimension. Further
treatment becomes necessary to characterize the
shape descriptor and to reduce the dimension of the
problem space.
2
Z. Cheng, Dynamic Human Shape Description and Characterization
Figure 4. The time histories of 120 bins of PSD for four activities
3
Z. Cheng, Dynamic Human Shape Description and Characterization
Therefore, the principal component analysis (PCA)
is used to characterize the high-dimensional space
defined by the PSD. Denote
(1)
pijk  { p1 p2  p120}Tijk ,
as the PSD shape descriptor for the i-th subject in jth activity at k-th frame. For the data collected,
denote
(2)
P  {pijk }, i  1,..., I ; j  1,..., J ; k  1,..., K .
In this paper, dynamic shapes were created for the
four activities at the frame rate of 0.02 /s, with 85
frames for jogging, 352 frames for limping, 554
frames for shooting, and 227 frames for walking.
The percentage of variance of each principal
component (PC) is shown in Fig. 5, and the first
four PCs are show in Fig. 6. The original PSD
vector can be projected onto the space (eigenspace)
formed by PCs, that is, it can be expanded in terms
of PCs. As shown in Fig. 5, among all 120 PCs,
only the first 10~20 are significant. This means that
the original PSD can be reasonably approximated
by its first few projections in the eigenspace and
represented by the projection coefficients
corresponding to these significant PCs. Figure 7
illustrates the time histories of the first and second
projection coefficients for four activities.
where I represents the number of subjects, J is
the number of actions, and K is the number of
frames for each action. However note that the
number of activities that each subject performs can
be different and the number of frames for each
activity can be different also. By performing PCA
of P one can find the principal components that
characterize the space defined by the shape
descriptor.
Figure 5. Percentage of variance of each principal component
Principal Component 2
0.30000
0.20000
0.10000
0.00000
-0.10000
-0.20000
-0.30000
-0.40000
Magnitude
Magnitude
Principal Component 1
0
40
80
0.30000
0.20000
0.10000
0.00000
-0.10000
-0.20000
-0.30000
120
0
40
Bin Number
120
80
120
Principal Component 4
Principal Component 3
0.20000
0.30000
0.10000
0.20000
Magnitude
Magnitude
80
Bin Number
0.00000
-0.10000
-0.20000
0.10000
0.00000
-0.10000
-0.20000
-0.30000
-0.30000
-0.40000
0
40
80
120
0
40
Bin Number
Bin Number
Figure 6. First four principal components of PSD.
4
Z. Cheng, Dynamic Human Shape Description and Characterization
jog1 Principal component 2
200
150
100
50
0
-50
-100
-150
-200
magnitude
magnitude
jog1 Principal component 1
0.000
0.500
1.000
1.500
2.000
250
200
150
100
50
0
-50
-100
-150
-200
-250
0.000
0.500
1.000
Time (sec)
300
250
200
150
100
50
0
-50
-100
-150
-200
2
4
6
300
200
100
0
-100
-200
-300
-400
8
0
2
4
Time (sec)
magnitude
magnitude
200
100
0
-100
-200
-300
4
6
8
10
300
200
100
0
-100
-200
-300
-400
12
0
2
4
6
Time (sec)
magnitude
magnitude
2
10
12
walk1 Principal component 2
100
50
0
-50
-100
-150
-200
-250
-300
-350
1
8
Time (sec)
walk1 Principal component 1
0
8
shoot1 Principal component 2
300
2
6
Time (sec)
shoot1 Principal component 1
0
2.000
limp1 Principal component 2
magnitude
magnitude
limp1 Principal component 1
0
1.500
Time (sec)
3
4
Time (sec)
5
200
100
0
-100
-200
-300
-400
-500
0
1
2
3
4
5
Time (sec)
Figure 7. Time histories of the first and second projection coefficients for four activities.
Denote
(3)
WM  {v1 v 2  v M } ,
where v m
is the m-th principal component
(eigenvector) of P . The original observations (data)
can be projected onto the space defined by W , that
is,
(4)
YMT  PT WM ,
Where YM  Y[M , N ] is the matrix of projection
coefficients, each column of which corresponds to
each original record, M is the dimension of a
shape descriptor (M=120 for PSD), and N is the
number of total shapes observed (N=1218 for the
case of this paper).
From Fig. 5 we can see that among the total of 120
principal components, the significant ones are less
than 20. This means that instead of using the full
space of dimension of M, one can construct a new
space with only the significant principal
components, that is,
(5)
WL  {v1 v 2  v L }, L  M ,
which would substantially reduce the dimension of
the space. As for the case investigated in this paper,
5
Z. Cheng, Dynamic Human Shape Description and Characterization
L  20 , which is much less than M  120 .
Then the projection in this space is given by
(6)
YLT  PT WL ,
where YL:  Y[ L, N ] . Each original record can be
either fully reconstructed by Eq. (4) or partially
reconstructed (approximated) by Eq. (6). Usually
an original record can be well approximated by its
partial construction with significant principal
components. This means that the original data with
dimension of M can be represented by its projection
coefficients with dimension of L ( L  M ).
In the space of reduced dimension, the problem
becomes tractable, as the number of variables
becomes much smaller. In fact, for the case in this
paper, the two projection coefficients corresponding
to the first two most significant principal
components are sufficient to represent the shape
dynamics for action recognition. The sequence of a
projection coefficient at each frame for a particular
subject in a particular action constitutes a time
series, as shown in Fig. 7 for example. It is shown
that the time histories of the first and second
coefficient are unique with respect to each action,
which can be used as the discriminators for activity
recognition.
4. Activity Recognition
Dynamics
Based
on
Shape
The shape dynamics of a subject during motion, as
described in Section 3, can be used for activity
recognition. In this paper, a data mining tool was
employed to classify four activities (jog, limp,
shoot, walk) based on 85 frames from each activity.
Note that in the classification, each frame was
treated independently rather than being placed in
sequence as a time series. Five attributes were used
in classification: (a) Pelvis_Velocity, the resultant
velocity at the mid-pelvis location; (b) PC1, the
first projection coefficient; (c) PC2, the second
projection coefficient; (d) PC1_Velocity, the
derivative of PC1; and (e) PC2_Velocity, the
derivative of PC2.
The significance of each attribute can be assessed in
terms of gain ratio as given in Table 1. While
Pelvis_Velocity is most significant, all five
attributes are selected for classification. Various
classification methods are available, such as those
provided
by
Weka
http://www.cs.waikato.ac.nz/ml/weka/.
Among
them, five conventional methods listed in Table 2
were chosen to use in the case study. All of them
achieved classification accuracy greater than 95%,
as shown in Table 2.
Table 1. Attributes ranking results
Table 2. Classification accuracy
5. Conclusion
Based on the study of this paper, the following
conclusions are in order.
 Shape dynamics contain the information about
both body motion and shape changes and have

great potential for human identification and
activity recognition. Shape dynamics can be
well-captured by a shape descriptor and further
characterized by principal components.
Human motion/action in 3-D space can be
replicated or animated with high bio-fidelity,
6
Z. Cheng, Dynamic Human Shape Description and Characterization


which can be used to generate the data for
training a model or to evaluate the performance
of a tool.
Using a dynamic 3-D human shape model for
human activity recognition is plausible. This
approach is unique as it differs from other
conventional techniques based on 2-D imagery
or models. It is effective as it can overcome the
shortcomings inherent in 2-D methods.
As a shape descriptor, the PSD is not
reversible. This means that while it can be used
for analysis, as it was used in this paper, it
cannot be used for shape reconstruction. Also,
spatial information may not be uniquely
represented in the original definition of the
PSD, which can be remedied by certain
treatments or modifications.
It should be pointed out that the dynamic shape
models used in this study were created from 3-D
surface scan data and motion capture data using
OpenSim and Blender. While these models provide
high biofidelic description of body shape during
motion, the body surface deformation may not be
fully or accurately represented by these models.
However, since the body shape variation induced
by the articulated motion is much larger than the
surface deformation, most observations and results
from this paper can be reasonably postulated to be
true even if the surface deformation is more
precisely represented. More investigations are
needed to validate this assumption.
18th International Conference
Recognition (ICPR'06).
on
Pattern
Kilner J, Guillemaut J-Y, and Hilton A, 2009. 3-D
Action Matching with Key-Pose Detection. In:
2009 IEEE 12th International Conference on
Computer Vision Workshops.
Niebles J-C and Li F-F, 2007. A Hierarchical
Model of Shape and Appearance for Human Action
Classification. In: IEEE Computer Vision and
Pattern Recognition (CVPR 2007).
Paquet E, Rioux M, Murching A, Naveen T, and
Tabatabai A, 2000. Description of shape
information for 2-D and 3-D objects. Signal
Processing: Image Communication 16 (2000), pp
103-122.
Robinette K, 2003. An Investigation of 3-D
Anthropometric Shape Descriptors for Database
Mining. Ph.D. Thesis, University of Cincinnati.
Acknowledgement
This study was carried out under the support of a
SBIR Phase I funding (FA8650-10-M-6092)
provided by the US Air Force.
References
Blank M, Gorelick L, Shechtman E, Irani M, and
Basri R, 2005. Actions as Space-Time Shapes. In:
Proceedings of the Tenth IEEE International
Conference on Computer Vision (ICCV’05).
Cohen I and Li H, 2003. Inference of Human
Posture by Classification of 3-D Human Body
Shape, IEEE International Workshop on Analysis
and Modeling of Faces and Gestures, ICCV 2003.
Chu C-W and Cohen I, 2005. Posture and Gesture
Recognition
using
3-D
Body
Shapes
Decomposition. In: Proceedings of the 2005 IEEE
Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR ’05).
Jin N and Mokhtarian F, 2006. A Non-Parametric
HMM Learning Method for Shape Dynamics with
Application to Human Motion Recognition. In: The
7