The impact of Orthogonal variation in Chemometrics :

The impact of Orthogonal variation in Chemometrics:
- Review of 15 years of method development and applications
Johan Trygg
Hans Stenlund, Erik Johansson, Max Bylesjö, Svante Wold
Computational life science cluster (CLiC)
Department of Chemistry,
Umeå University, Sweden
What is Orthogonal Variation?
The concept of Orthogonal variation
Defining properties of Orthogonal variation (X matrix)
• Systematic variation in X
• Orthogonal to Y (considering noise level in data)
• Belong to the X space (i.e. you can use it to predict new samples)
Orthogonal variation is important for understanding a complex system
– Gender, Drift, Side reactions, Unknown interferents, Sampling,
Experimental problems, Non-linearities, biological variation
The new set of ‘O’-methods, OPLS, OPLS-DA, K-OPLS, O2PLS, OnPLS and
other related methods divide the systematic X-variation into two parts:
– What in X is related to Y;
Predictive variation
– What in X is uncorrelated to Y;
Orthogonal variation
Orthogonal variation – schematic view
Y-predictive
noise
X variables
1 2 3 4 5
Y-predictive
Y-predictive
Measured signal is the sum of many contributing factors
– Pharmaceutical tablet formulation (e.g. binders, fillers, active
drug, lubricant)
– Human urine sample (e.g. genetics, diet, gender, age, stress,
disease)
– Plant biotech / Pulp & paper (e.g. wood species, cellulose &
lignin content, water, age)
– QSAR the molecular descriptors are a function of their
chemical and biological property/activity/function
Orthogonal
Orthogonal
Y-predictive
Orthogonal
•
1 2 3 4 5
% of variation
•
Lots of unknown systematic variation
– mostly due to poor knowledge…
– strong dietary, environmental, hormonal variations, etc…
– Experimental variation, sampling, instrumental variation
– Input material varies with supplier
X variables
% of variation
Orthogonal variation is included in the measured
data values, and form an integral part of data
variability (multi-component)
noise
Example: Two component system
Spectral profile of
Predictive component
X matrix
y1
y1
100
PLS
70
y2
y2
=
70
X = y x T+ y x T + E
1 1
2 2
70
Constraint: y ┴ y
1
2
Spectral profile of
2011-06-15
Orthogonal component
PLS results
Example: Two component system
Observed vs Predicted
One PLS component model
Observed vs Predicted
Two PLS component model
What about interpretation of PLS model
Example: Single-Y, two component system
Score plot
Loading plot
w*2c2
y
R2X(1)=94% variation
w*1
R2X(2)= 3.7% variation
w*1c1
w*2
Regression coefficients, b
p2
p1
Target rotation – Olav Kvalheim
Kvalheim O M, Karstang T V. Interpretation of Latent-Variable RegressionModels. Chemometrics Intell. Lab. Syst., 1989; 7: 39-51
•
Although a multi-component PLS model for a single-Ý variable,
– Only exists a single Y-related component
py’
yhat
X
b’coeff.
y
Orthogonal Signal Correction (OSC)
- Presented in 1997 in Lahti by S.Wold
•
Basic idea, X  Y:
Remove structured noise (i.e. systematic) from X not correlated to Y
(i.e. YTto=0)
– X = to pTo + XE
X
= to pTo +
Indirect approach
1.) Starting vector to vector, use as y-variable
2.) PCA/PLS/PCR regression model with to as
y-vector to find orthogonal component
3.) iterate if necessary
References:
OSC Wold (1998), OSC Sjöblom (1998)
DO Andersson (1999), DOSC Westerhuis
(2001), POSC Trygg (2002)
XE
Direct approach
1.) Calculate covariation matrix W= X’Y
2.) Any row vector in X orthogonal to W is
an orthogonal component.
References:
OSC Fearn (2000),
OSC Höskuldsson (2001)
OPLS Trygg (2002)
9
Reviews: Svensson (2002), Goicoechea (2001)
How OSC method (Wold et al. 1998) finds
one Orthogonal component
tOSC
X
PLS
Make score t
orthogonal to y
t

y
wT
Multi-component PLS model
PCA component
(w = regression coefficient vector)
OSC method…
problems with overfit in estimating OSC component (s)
failure to achieve Y-orthogonality
unclear objectives – can result in a more complex model
it does not consider the prediction model (e.g. PLS)
two-step process (OSC + PLS)
Orthogonal projections to latent structures (OPLS)
- Prediction model with integrated filter (Orthogonal+Predictive)
Trygg J, Wold S. J. Chemometr., 2002; 16: 119-128
Po’
po2’
po3’
po4’
X = tppp’ + ToPo + E
y = upcp’ + f
Only a single Y-related component.
To
pp1’ tp1
to2to3to4
up
y
OPLS
XE
wp1*
cp’
c1’
11
Orthogonal projections to latent structures (OPLS,2002)
- Prediction model with integrated filter (Orthogonal+Predictive)
Some theoretical properties of OPLS
Alternative methods
• POSC (Trygg, 2002)
• PLS-PCP (Langsrud, 2003)
– Focus on prediction
For single-Y variable OPLS model:
wo= p - w
•
•
•
PLS-CCA (Yu, 2004)
PLS-ST (Ergon,2005,2007)
XTP (Kvalheim, 2008)
•
Additional theoretical aspects
of OPLS
– Verron (2004)
– Kemsley (2009)
PLS vs OPLS model
Example: Single-Y, two component system
PLS model
OPLS model
to[1]
Orthogonal
Predictive
94% variation
3.7% variation
w1,w*1
49% variation
p1
w*2
49% variation
p2
p2
p1
Predictive profile
2011-06-15
Orthogonal profile
Understanding Orthogonal variation is important
OPLS
90°
HS_rot90.M2 (OPLS)
0,12
HS_rot45.M2 (OPLS)
OPLS
p1p
0,05
0,10
45°
p1p
p1p
0,04
0,08
0,03
0,06
0,02
0,04
0,01
t[2]O
t[2]O
0,02
-0,00
-0,02
-0,04
p1o
-0,06
p1o
p1o
-0,02
-0,03
-0,08
-0,04
-0,10
-0,05
-0,12
-0,3
-0,2
-0,1
-0,0
0,1
R2X[1] = 0,4968
PLS
0,2
-0,5
0,3
R2X[2] = 0,496254
90°
HS_rot90.M1 (PLS)
0,08
-0,4
-0,3
-0,2
-0,1
-0,0
0,1
R2X[1] = 0,845847
0,2
0,3
0,5
45° p1
HS_rot45.M1 (PLS)
p1
0,4
R2X[2] = 0,149945
PLS
p1
0,04
0,06
0,03
0,04
0,02
0,02
0,01
t[2]
t[2]
0,00
-0,01
0,00
0,00
p2
p2
-0,01
-0,02
p2
-0,04
-0,02
-0,03
-0,06
-0,04
-0,08
-0,4
-0,3
-0,2
-0,1
R2X[1] = 0,955399
-0,0
0,1
0,2
0,3
0,4
R2X[2] = 0,0376559
-0,6 -0,5 -0,4 -0,3 -0,2 -0,1 0,0
R2X[1] = 0,989184
0,1
0,2
0,3
0,4
0,5
0,6
R2X[2] = 0,00660776
OPLS multi-Y in multivariate
- Pure profile estimation!
•
Direct calibration predicts X from Y (Classical Least Squares)
X = YKT + E
•
Indirect calibration predicts Y from X (PLS, OPLS)
Y = XB + F
B are the regression coefficients for X (XY)
K are the regression coefficients for Y (YX)
K matrix is useful for spectral or chromatographic data
- estimate of the pure profile for each analyte (column) in Y
- useful model diagnostics (focus on correct variation in model)
B matrix does not have similar interpretation
However, there is a link between them,
K=B(BTB)-1
Trygg, J. Prediction and pure profile estimation in multivariate calibration,
J Chemometr., 2004 (18) 166-172
Single-Y vs multi-Y OPLS models
Trygg J, Prediction and spectral profile estimation in multivariate
calibration JOURNAL OF CHEMOMETRICS 18 (3-4): 166-172 MAR-APR 2004
Two single-Y OPLS models
84 %
variation
15 % variation
Multi-Y OPLS regression
K=B (BTB)-1
po2
y1
y1
Predictive profile
84 %
variation
y2
Y-orthogonal
profile
50 %
variation
Predictive profile
15 % variation
y2
50 %
variation
2011-06-15 profile
Predictive
16
Predictive profile
X
orth
17
Case study: Plant metabolomics on Poplar trees
PttPME1 expression was up and down regulated in transgenic aspen trees
PME enzyme activity in wood forming tissues was correspondingly altered
Lines in this study
WT poplar
5‐ down regulated PttPME1 gene
Orthogonal variation
Metabolomics study of xylem
OPLS model
Between class variation Orthogonal variation
Plant metabolomics on Poplar trees
Orthogonal variation reveals experimental problems with scraping
Line 5 vs WT
Orthogonal S‐plot
Multivariate calibration
Carrageenan application
•
Carrageenans are polysaccharides extracted from seaweed, which are
used as gelling and thickening agents in a wide range of industries,
including food, pharmaceuticals and cosmetics. Five naturally occurring
carrageenan types, viz. Lambda, Kappa, Iota, Mu and Nu
•
•
Three spectral techniques (NIR, IR, Raman) – DOE mixture design
Objectives:
– (i) to find out overlapping spectral information;
– (ii) to highlight the unique features of the different spectroscopic techniques;
– (iii) to accomplish a predictive calibration model for five different carrageenan
constituents.
–
Reference: M. Dyrby et al., Carbohydrate Polymers 57, 337-348, 2004.
Hi-OPLS & Hi-OPLS/PCA
Eriksson L, Toft M, Johansson E, Wold S, Trygg J, Separating Y-predictive and
Yorthogonal variation in multi-block spectral data, JOURNAL OF CHEMOMETRICS,
20 (8-10), 352-361 2006
•
•
Base level: OPLS models between each spectral block & Y
Top level: Separate OPLS and PCA models
P
O
5
Y
OPLS to focus on
Y-correlating information
P
PCA to focus on
Y-orthogonal variation
O
P
704
O
P
667
NIR
O
3406
IR
5
Raman
Y
102
102
102
102
26
26
26
26
Interpretation of PCA model
Orthogonal variation
Line plot of t1 reveals time trend !
Hi-Carra_NIR_IR_Raman_SNV.M7 (PCA-Class(1)), PCA Top level only orth score vars
t[Comp. 1]
4
Base level OPLS model
NIR & IR reveal water peak
(Raman not influenced)
3 SD
3
2 SD
2
Contribution plot
1
Hi-Carra_NIR_IR_Raman_SNV.M7 (PCA-Class(1)), PCA Top level only orth score vars
Score Contrib(Obs Group - Obs Group), Weight=p[1]
60
70
80
90
Num
R2X[1] = 0.186045
Day 1
Day 2
Day 3
Day 4
100
-1
-2
-3
Var ID (Primary)
$M5.t8(Ram
50
$M5.t7(Ram
40
$M5.t6(Ram
30
$M5.t5(Ram
20
$M4.t7(IR
10
$M5.t4(Ram
0
$M4.t6(IR
3 SD
0
$M4.t5(IR
2 SD
-4
$M3.t7(NIR
-3
1
$M3.t6(NIR
-2
2
$M3.t5(NIR
-1
$M3.t4(NIR
0
Score Contrib(Obs Group- Obs Group), Weight=p1
t[1]
t1
Orthogonal variation for
fault detection and quality control
NIR spectroscopy data
PLS scores
OPLS scores
23
Chemical imaging application:
FT-IR imaging spectroscopy on mouse liver
1.) Stenlund H, et al., ANALYTICAL CHEMISTRY, 2008, 80, 6898–6906
2.) Gorzsás A,, et al., THE PLANT JOURNAL doi: 10.1111/j.1365-313X.2011.04542.x
Liver samples with two different cell types
- Hepatocytes (cell of the main tissue of the liver)
- Erythrocytes (red blood cells)
Bruker Equinox 55 spectrometer
FPA detector (64x64)
Orthogonal Projections
to Latent Structures (OPLS)
Example PAT: Binary powder
Access to the current data set was kindly granted by Dr Ola Berntsson of AstraZeneca,
Södertälje [Berntsson, et al., 2000; Berntsson, 2001]
• Diffuse reflectance NIR spectroscopy
• Mixture of two powders with markedly different particle size
• 11 batches of powders, 0% to 100% in steps of 10%.
• X = NIR spectra (SNV) in the range 1080-2025 nm
• Y = % binary mix of powders
PLS model scores
Figure: Schematic overview of the vertical
cone mixer and the fibre-optic probe set-up.
OPLS model scores
Example PAT: Binary powder
Non-linearity is detected in Orthogonal components
PLS loading profiles (p)
OPLS loading profiles (p)
Example: Batch processes
Orthogonal variation = Kinetics
Batch mini plant: Hydrogenation, Nitrobenzene to Aniline
OPLS model 1st derivative UV spectra vs Gas feed
Y-orthogonal (7%)
Modelled variation in X
Loading vector po2
Y-predictive(92%)
Loading vector pp1
Loading profile similarity
 Kinetic differences
Not a competing side reaction.
Nitrobenzene (260nm)
Aniline (235nm)
2011-06-15
27
OPLS method was top-ranked for
microarray-based predictive model, not PLS!
(1) The performance of the prediction
models depend largely on the quality
and relevance of data
(2) The experience and proficiency of
the data analysis team are crucial
factors for success
(3) Different prediction methods
yield similar prediction results.
Reference: Leming Shi et al.,Nature Biotechnology, Aug 09 2010.
doi:10.1038/nbt.1665
O2PLS model for overview
- extended OPLS model
•
•
•
Separate model for joined and orthogonal variation
Model of X:
X = TpPpT + ToPoT + E
Model of Y:
Y = UpQpT + UoQoT + F
X-Y Joint Variation
’Y‐orthogonal’
’Y‐predictive’
P T
O
’X‐predictive’
PT
Unique to X
T
O
X
’X‐unrelated’’
QT
T
O2PLS
U
Q T
o
U
Y
Unique to Y
o
Trygg J, Wold S, O2-PLS, a two-block (X-Y) latent variable regression (LVR) method
with an integral OSC filter JOURNAL OF CHEMOMETRICS 17 (1): 53-64 JAN 2003
O2PLS for overview has been extended
- finds two types of Orthogonal variation
Po’
po1’
po2’
po3’
po4’
Orth(OPLS)
To
O2PLS model
- Finds ALL systematic Orthogonal variation
- PCA on residual matrix E
- Does not affect prediction model
- Added to existing Orthogonal scores and loadings
up
tp1
O2PLS
XE
Orth(PCA)
to1to2to3
pp1’
E
po4’
po5’ to4to5
y
qp1’
O2PLS for two-block overview (think PCA)
Preference mapping
•
Sensory and preference data for a set of 13 apples
– 70 sensory attributes (X-variables); panel averages across 12 judges
– 108 comsumer likings (Y-variables), expressed on a nine-grade scale
– Original reference [MacFie, H., et al., 1999].
70
108
Sensory judges
X
Consumer Likings
Y
13
•
13
Group formation among sensory attributes
–
–
–
–
–
–
–
1_ is a "First Bite" attribute,
E_ is an "External Appearance" attribute,
EA_ is "External Aroma”,
A_ is "Astringent aftertaste”,
F_ is "Flavor",
I_ is "Internal Appearance“,
T_ is "Texture"
Preference mapping
O2PLS overview plot
X model
Y model
SensCons.usp_5.M16 (O2PLS)
X/Y Overview
R2 Predictive (X)
R2 Orthogonal in X (PCA)
R2 Predictive (Y)
R2 Orthogonal in Y (OPLS)
1
R2 Orthogonal in Y (PCA)
R2
0,9
0,8
0,7
0,6
Q2
0,5
0,4
0,3
0,2
0,1
Y Model
X Model
0
SIMCA 13.0 - 2011-06-08 14:14:44 (UTC+2)
O2PLS modeling in Preference mapping
Unique variation in Y (uncorrelated)
Unique variation in Y (28%)
Consumers likings not picked up by Sensory data
Process integration in pharma industry
Food and Drug Administration (FDA)
Instead go from product testing to quality by design!
Risk minimisation – process understanding
Systems biology approach:
Combined profiling of transgenic Poplar
Study design
O2PLS model results
G3 & G5 contain antisense constructs of
the gene PttMYB21a, affecting plant
growth.The closest ortholog to PttMYB21a
in Arabidopsis thaliana is AtMYB52
New OPLS developments:
OnPLS for modeling any number of matrices
•
Extension of O2PLS, uses MAXDIFF for predictive model, Fully symmetric
•
Tommy Löfstedt, Johan Trygg. OnPLSۛa novel multiblock method or the modelling of predictive
and orthogonal variation. Journal of Chemometrics, 2011.
Mohamed Hanafi and Henk A. L. Kiers. Analysis of k sets of data, with differential emphasis on
agreement between and within sets. Computational Statistics & Data Analysis, 51(3):1491-1508,
2006.
•
Concluding remarks
Introduction of OSC, OPLS and related efforts
has made a real impact. In biology, our methods are now
even more credible and accessible for scientists outside our field
• OPLS is used by more than 150 Swedish companies, 50 international institutions
and the ten largest pharmaceutical companies in the world
• More than 500-600 citations in total so far citing these methods.
• We have only begun to scratch the surface of the potential Highest impact journals
(Nature, Lancet)
Citations
Thesis 2011
Acknowledgements
Chemistry dep, Umeå University
M Eliasson
M. Bylesjö
H. Stenlund
R. Madsen
S. Wiklund
P. Jonsson
H. Antti
Uppsala University
T. Lundstedt
Imperial College
J. Nicholson
E. Holmes
M. Rantalainen
O. Cloarec
Umeå Plant Science Center, Umeå Univ, Sweden
T. Moritz
L. Gerber
A. Sjödin
R. Nilsson
A Grönlund
S Jansson
B Sundberg
G. Wingsle
J. Karlsson
V. Srivastava
R. Bahlerao
G. Sandberg
Riken University
M. Kusano
MKS Umetrics
E. Johansson
L. Eriksson
J. Christensen
S. Wold
Funding: Swedish Research Council
Swedish Foundation for Strategic Research (SSF)
FORMAS Knut & Alice Wallenberg Foundation
GlaxoSmithKline
AstraZeneca
MKS Umetrics