Feature selection from gene expression data regulatory networks. Anne-Claire Haury

Feature selection from gene expression data
Molecular signatures for breast cancer prognosis and inference of gene
regulatory networks.
Anne-Claire Haury
Centre for Computational Biology
Mines ParisTech, INSERM U900, Institut Curie
PhD Defense, December 14, 2012
1
Introduction
2
Gene expression
Figure: Central Dogma of Molecular Biology (Source: Wikipedia)
=> RNA as a proxy to measure gene expression.
3
Microarrays: measuring gene expression
Microarrays: one option.
RNA hybridized onto a chip.
Quantification of gene activity in different conditions at the
genome scale.
Resulting data: gene expression data.
samples
High expression
genes
Low expression
4
Feature selection: extract relevant information
Genome ≈ 25,000 genes
From gene expression, explain a phenomenon
Feature selection: deciding which variables/genes are relevant.
5
Feature selection: extract relevant information
Genome ≈ 25,000 genes
From gene expression, explain a phenomenon
Feature selection: deciding which variables/genes are relevant.
Mathematically:
Explain response Y using variables (Xj )j=1...p
Y = f (X1 , X2 , X3 , X4 , . . . , Xp ) + ε
5
Feature selection: extract relevant information
Genome ≈ 25,000 genes
From gene expression, explain a phenomenon
Feature selection: deciding which variables/genes are relevant.
Mathematically:
Explain response Y using relevant variables amongst (Xj )j=1...p
Y = f (X1 , X2 , X3 , X4 . . . , Xp ) + ε
Objective 1: more accurate predictors
Objective 2: more interpretable predictors
Objective 3: faster algorithms
5
Contributions of this thesis
Gene Regulatory Network Inference
TIGRESS: new method based on local feature selection.
Ranked 3rd/29 at DREAM5 challenge.
Linear method, competitive with more complex algorithms
Molecular signatures for breast cancer prognosis
Select biomarkers to predict metastasis/relapse in breast cancer
patients.
Complete benchmark of feature selection methods.
Investigation of the stability issue.
6
alkB
alkA
ada
metI
metQ
yeiB
metC
metF
asnA
metN
folE
metA
metE
gudP
Gene Regulatory Network
Inference with TIGRESS
gudX
garD
purL
purB
gudD
metJ
purC
prs
ompT
purE
glyA
ubiX
asnCmgtA
rstB
purH
acrA
nemR
mltF
pgi
cysC
potH
ddpA
cysK
tauA
tauD
pqiA
fpr
fliL
tauC
astB
fliR
ddpB
astA
argA
argD
artP
flgI
ygbK
pyrC
fliH
fliN
fliS
glnK
glnG
gltL
uof
flhB fepB
exbD
hisM
rob
rutR
fliD
rutFflgB
fepD
fecI
fecR
rcnA
avtA
nrdF
cysW
thrV
fliA
hisJ
argG
fimD
marA
cysA
fepE
hisQ
fhuA
nrdE
wzb
uspE
ade
ftnA
oxyR
nrdI
yhiD
ycaC
gspO
ilvI
entS
dctR
entB
emrK
garP
mdtE
ydeO
hdfR
fur
fimA
garR
wcaB
yccA
envY
rprA
astC
pepAflgA
yhjH
rpmA
agaC
pgaA
aroG
iraP
agaR
hha
rplU
pgaC
nfo
fliQ
artQ
putA
argC
trpL
trpC
trpE
cydD
oppF
flgC
ccmG
trpB
trpR
fdnH
yqjI
artJ
ydiV
fliZ
argR torC
argF
artI
kbaY
agaB
pgaB
fliM
astD
argB
argH
agaI
agaS
pgaD
yciE
fiu
slp
csgD
gcvT
smpA
dppC
yjbF
tomB
entC
gadE
gadX
pyrD
hydN
yrbA
dadX
rutD
wcaA
yhjA flhC
fimF
rcsA glgP
fes
flhD hypB
gspB
rpsO lrp adiY
ssuB rutA
pncB
rutB pnp
gabT
rrfA
yjbG
ilvL fimIgadW
rutG
uxuAgabD
ompC
ydeP
fliY ilvDhycE
yfiD
serA
rcsB
gltB
ompF
rutC
csiD
ssuD
yjjP
oppA
yjbE
rnpB micF
pspC
cadA
ilvM
srlD
entD
katG
hns
cirA
carA
ilvA
sdhB
mlrA
acrF
argO
nupC bdm
sdiA
mglC
treR
omrB
galK
carB
katE
gdhA
gltF
nrdR
pspE pspG
cyoC
ssuCccmF
kbaZ
csiR
guaA
rstAmglB
lacA
sucA
csgG
ftsA
glnQ
relAgalM
hofO
fumC
torD
torR
narH csgF
lipA
hycG
osmB
stpA
ssuE
glnP
caiD
galR
yecR
gadC
rrsG
hupA rrsH
fimC
csgE
dadA
galE
dps
ccmB
marB
fecC uxuB entE
pspB
marR
trmD
lacI
nuoB
gadB srlR
dcuD fixC
rpsS
serC
rutE
zwf glnLsufS
glnH
gltD
gabP pspD
rrlA
fecD
ompR
srlB
ftsZ
nrdD
rtcR
oppB
entA
sufA
nrdG sufC
agaV
ndh
cyoB
lacY
gntP pdhR
napB ccmH
ulaF
fecB tnaC
ansB
mtr
sucC
nikD
mglA cydA
ompW
fdnI cysJ
wzyE
aroP
sufB
napC
hypD
iscA
hyfB
napG
metY
glnA
melA
nagA
yjbHfecE rpoS
pgm
hypE
amiA
guaB
gutQ
dppDhycF
aroA
acs
tyrR
rtcA
ssuA
glgA
nrfA
torA
lpd
iscU
ccmD
tar
adiA
ilvE
pspF nrfGgcd
rnlA
hupB
tnaB
bcsZ
sodA
feoB
glcG ydhT
nuoA sufD
rtcB ibpB
glpA
feaB
osmE
sra
rpsP
napA
rffA
napF
sohB
rybB
nrdB gutM
uxaB
nrfF
oppD caiA fimH cadB
tehA
glpB
rbsK
ppiA
bglB
chbG
paaJ
entF
galS
rplD
cheBdmsD
glcF
rrsC
sodB
bcsB
hypA
ccmC nuoI
malQ deoA
feaR
nuoF
fepA
sdhD
uxaC
bolA
paaI
ynfK
dmsC nirD
napD
feoC malS
infC
glpG
yfgF
proL
truB
nagD
glcA infB
dkgB
wzzE
ynfE
malP
srlE
osmY
folA
hypF
galP
cyoD
fimB
uidR malZ
fumB
uxuR
hyfG
rbfAyiaN
sdhA
hycI
tdcC nusA
nrfB nrfD
hcr
pheS
cheZ
treB sgbH
fimGtyrP
nrfE dppF
ihfA
caiB
tnaA
uidC
thrS
nsrR
narP
bglF
glpC
tdcF
cysI
ihfB
napH
feoA
nuoM gadA
atoD
ccmE
rffH
chbA
uidA
cyoE
cadC
fixB
rpoH
cyoAsucD
dmsB
ytfE
csiE
hyfC
galT
cspA
hybB
ygbA
fumA
cyaA
rbsR
melR
nrfC
cheR
tehByjgI
focB
sufE
fnr
glmY fadL hyfR aceE
pspA nuoK
hybA
ptsI
sgbE
hycB
chbF
tdcE
srlA
rfe
caiF
lacZ
ynfG
malM
hycC
ydhX
fhlA
fecA
dppB acnA
flgF
modE
creD
rplP
ydhU
tdcD
iscR
hlyE
ydhV
moeBmoaA
cydC
frdB
mdh
uxaA ulaA
nuoG
hyfF
hyaC
dmsA
chpA
caiE
ptsG
yiaK gltA
araF
mhpA
rpmI
modB
rffD rffC
uidB
exuR
oppC
scpA
cysG
modA
bglG
rbsD
atoE
tdcA
glcB
norR
malY
rffM
ydiU
araD
paaC
norW
ydhW
narL
fucO
araG
uraA
paaX
creB
dcuR
mhpF
adhE
treC
hemF
prfA
nanC
creA
xdhC
aldB
paaH
arcA glcChpt yjcH
nikR
melB
narK
nanR
crp
sbmC
paaA paaF
narI
aceB tdcRulaC
cheY
lsrC lsrG nagC
sdhC fis
cstA
frdD
glpT
ivbL
malE
cytR
hyfD
yeaR hycD
nikA narX
araE
ydeJ
ulaE
norV
appY
nuoN
yeaE grxD
nagB
sucB
fixA
glcD
atoB
rbsC
iscS
chpR
acrEactP
uvrA
hipB
rbsA chbR
ynfF
cdd
rffG
yiaO
dacC
rbsB
erpA
gyrA
ygjG
lsrD
glpR paaE
manZudp
hipA hemA hypC cydB
aceK
nanM
ynfH
moaE
grpE
dgsA
pheM
malT
deoD
frdC
yeiL
narG
hyfI
mhpD
nirC
dusB
nirB
mhpE
hyaFdctA
icd
aldA
dnaA
yhcH
nikC uspB
mhpB
nikB
malG
yiaJ
aceF
rffE
mhpR
malI
ydhO
fucP
ulaG
fdhF
hyfA
manX
fdnG hybF
paaD
acnB
hyaB
yoaG
proP
bglJ xylR
ackA
creC
nuoE
lamB
hyaD
ydbD
asr
malK
moaB
agp
nanE
hcp
nuoL
dcuB ulaR yiaL
chbB
paaG
tpx
nuoJ
sfsA
hycA
ydhY
trg
mhpT
tdcB
hofB
rpmC
glmS
rplC
yjjQ
pheT
aceA
ssb
wzxE
ybdN
nagE
nuoH
atoA
appC
dksA
argP
nanK
xseA leuO
tadA
ulaD
hyfH
paaB
mhpC
focA
nikE
ubiC
xylA
ecpD nanT
glpX
pstC
ubiA
glpD
tsgA
gatC ulaB
nfuA yiaM
chbC
fucR
lsrB
glmU
moaC
narJ
crr
tam
fucU
fucI
dppA
gatZ
deoB
glpQ
pflB hybC
upp
deoC
appAprmC
idnR
ilvB
ampC
hyfE pstB
speC
ybfN
rpsJ
lsrR
betT
gatD
rplT
dcuA caiC
lldP
fucA
rpsQ
nrdA fruR
rplB moeA
malF
rhaT
manY
nuoChyfJ
cueR
phoU
trmA
rpsC
aspA
araJ trxA
dcuShyaA
hyaE
araC
hofN
thrT
deoR
gntT
xylF
serT
yccBhycH
gcvA
modC pstA
rplW
prpE
appB
paaK
exuT
tsx
tapmoaDatoC
araH
malX
prpD
htrE
lsrA
hybErimM
fadH
dnaN
rplV fadD
araB
pitA
leuW
puuA
recF
xylH
fixX
envZ
polA
ubiG
hofM
mtlA nanA
caiT
gapA
hybD
ppdD
ptsH
lsrK
gatY
rpsI
flxA
argK
xdhA dcuC
ychO
uspA
rplS
murQ
gatA xylG
araA
idnK
prpR
aer
puuR
rhaS
fadB
yhfA
pstSrplM
queA
idnD
hybG fadI
ysgA
dsdX
iclR
gntK
idnO
hybO
tdcG
lsrF
fucK
mtlD
fadA
glpK
gntX
pepT
glpF
mpl
ascF
ychH
betA
gatB
dsdC
xdhB
spf
cspI
msrA
ydeA
fbaA
scpC
thrU
betI
ugpC
frdA
lldR
argU
ogt
fadJ
pgk
apaH
idnT
gntR
apaG glgS
prpB nupG
hofP
cpdB
ompX
mazG
ompA
rhaB
ugpE
argW
mtlRzraSprpC
gntU
rhaD
dapB
proM
topA
fadR yibD
glpE
murP
leuL
fadE
lldD
ascB
cyaR
psiE xylB
hisR
zraR
gyrB
ugpA
thrW
gltX
ugpQ
uhpT
fruK
betB
proK
sgbU
epd
glyU
puuD
ugpB
hofC
leuC
rhaA
envC
dsdA
leuP
dsrA
eda
ascG
pdxA
dpiA
ilvN
rhaR
slyA
argX
leuX
yahA
scpB
yibQ
fruA
leuA
edd
tyrU
glk
leuB
fbaB
fruB
pfkApykF
astE
yeaG
agaD
cpxP
dsbC
mcbR
csgB
seqA
tppB
rcsD
hchA
cheA
ppiD
csgC
pepDadrA
nhaR
smtA
omrA wzc
livMmdtF
nadR
yciF
nhaA
ecnB
osmC
hemL
ybdZ
garLglgC
degP
proW yciG
yhiM
nadA
rseA
rpoE
cpxR
csgA
mukE evgS lpxD
sdaA
proX
ilvH
ftnB
ycfS
fabZ
gnd
evgA
fliC
yojI
livG
hdeD
aslB
yebE
lpxA
ydeH
chiA
livF
mukF
yjjZ
nrdH
emrY
ybaT
gadY
wza
hdeBhdeA
nac
bacA
acrD
proV
livH
garK
cysM
mdtB
yidQ
motA
psd
ryhB
ydfN
livJ
mukB
tdh
lysU
cysP
fimE
phoP
flgN
gcvP
ygiC
pnuC
yqjA
rseC
rdoA
baeS
baeR
cspD
livK
nadB
dsbA
cpxA
tsr
mdtC
yhhY
asnB
kbl fhuF
gltI
purM gcvH
lrhA
cysH
hmp
soxS
gltJ
acrR
flhA
flgJ fliK
flgD
hisP
ygdB
cbl
flgMsoxR
ccmA
fliP
fliF
cysB
fliJ
rimK
cysU
ybjCflhE
flgE
recC
ycgRnfsB
flgH
ygiB
nfsA
ybjNacrB
tolC
flgG
fliI
purN
ppdA
yqjH
artM
ddpC
fliO
ppdC
poxB
ppdB
amtB
argI
ddpX
inaA
fliT
fldA
potF
potG
argE
codA
metK
pqiB
gltK
rfaZ
tauB
ribA
argT
ddpD
yeaH
ddpF
qseB
mdtD
fepC
fhuE
fepG
ygaC
gpmA
fhuC
lon
mntH
grxA
mioC
codB
purA
fldB
yhdX
yhdZ
cdaR
purR
fliG
yncE
yhdY
potI
nemA
ybiS
trxC
dsbG
fliE
motB
spy
fhuB
tonB
rcnR
cheW
ung
yfdX
gor
mnmG
ybaS
exbB
metH
fhuD
cysN
yjeP
rseB
mdtA
frc
hemH
flu
purD
oxyS
ybaO
yegR
ahpF
ybjG
slyB
yneM glgX
borD
cueO
metL
gmr
hslJ
cysD
aidB
metB
glgB
qseC
mgrB
yrbL
envR
copA
pagP
ahpC
purK
hflD
gcvB
mprA
yhdW
mqsR
metR
purF
speA
glnB
emrA
mntR
phoQ
cvpA
speB
emrB
trpA
aroH
aroL
putP
trpD
aroM
yaiA
tyrA
aroF
dinF
lpxC
ftsQ
symE
tyrB
hokE
dinB
murG
ydjM
ftsI
ybfE
umuD
ddlB
yafQ
mraY
dnaG
rpsU
ftsW
polB
insK
phr
murD
ruvA
dinG
cho
murE
dinD
umuC
recX
recA
uvrC
uvrD
lexA
dinI
ftsK
sulA
yafO
uvrB
murF
uvrY
ftsL
ruvB
rpoD
yafN
tisB
yafP
yebG
dinQ
dinJ
recN
murC
ACH, Mordelet, F., Vera-Licona, P. and Vert, J.-P., TIGRESS: Trustful
Inference of Gene REgulation using Stability Selection, BMC Systems
Biology, 2012, 6:145.
gpmM
tpiA
glyT
leuD
phoB
pck
eno
uhpA
setA
citC
sgrS
htpG
citG
citX
dpiB
kdgR
citD
citF
phnE
ppc
zraP
fabA
citE
phoE
fabB
phnO
phnJ
phoA
pitB
phoH
phoR
phnN
phnI
psiF
phnD
fabR
phnH
phnM
phnG
amn
phnL
sgrR
phnF
phnK
phnC
phnP
thiQ
thiP
tbpA
7
Gene Regulatory Networks
Gene regulation: control gene expression.
Transcription factors (TF) activate or repress target genes (TG).
Gene Regulatory Network (GRN): representation of the regulatory
interactions between genes.
alkB
alkA
ada
metI
metQ
yeiB
metC
metF
asnA
metN
folE
metA
metE
gudP
gudX
garD
purL
purB
gudD
metJ
purC
prs
cvpA
ompT
purE
speB
speA
emrB
glnB
glyA
ubiX
mqsR
metR
pagP
aidB
purF
ahpC
purK
hflD
glgB
qseC
mgrB
yrbL
metL
asnCmgtA
gcvB
emrA
mntR
phoQ
purH
ybjG
rseB
nadB
pnuC
motB
ybaS
spy
fhuB
tonB
dsbA
ung
yfdX
exbB
metH
fhuD
cheW
mdtA
frc
hemH
flu
purD
gmr
yjeP
yegR
ahpF
slyB
yneM glgX
borD
rstB
envR
hslJ
metB
mdtD
cpxA
rseC
yqjA
tsr
mdtC
yidQ
rdoA motA
baeS
baeR
psd
mdtB
ydeH
rseA
bacA
asnB
ryhB
metK
nadR
pqiB
acrD
fhuA
yebE
rpoE
proV
ygbK
pyrC
fepD
rcnA
fliH
ydfN
ftnB
lpxA
avtA
flgF
yccA
purA
fecI
livH
livJ
chiA
fecR
mukB
fliO
ycgRnfsB
livF
tdh
flgH
ycfS
uof
lysU
cheA
fabZ
cysP
mukF
potI
wzb
ppdC
poxB
uspE
rfaZ
flhB fepB
nrdF
cpxP
ade
fliJ acrR
yjjZ
cpxR
tauB
csgA
gnd
cysW
inaA
exbD
flgI
ribA
ygiB
nfsA
fliR
ppiD
argT
k b l fhuF
ftnA
flhA
rimK
fliT
nrdH
mukEevgS lpxD
ddpD
cysU
csgC
fldA
dsbC
mcbR
evgA
acrB
garK
gltI
ppdB
ybjC
cysK
ybjN
tolC
potF
flhE
tauA
fimE
ddpA
amtB
sdaA
fliC
flgJ fliK
phoP
cysM
yciF
tauD
emrY
ybaT
flgE
yojI
proX
potG
degP
pgaD
flgG
purM gcvH
lrhA
thrV gadY
pepDadrA
fpqiA
pr
wza
hdeBhdeA
proW yciG
yeaH
cysH
fliI
ilvH
recC
purN
fliN
nrdE
argI
hmp
yhiM
yhdX
flgD
livG
flgN
wcaB
gcvP
aslB
fliL
ppdA
nhaA
hisP
nac
tauC
hdeD
pgaB
yciE
yhdZ
ygiC
astB
osmC
csgB
ddpX
fliF
ecnB
oxyR
cysB
hisM
nrdI
gspO
hemL
ilvI
fliA
rpmA
yqjH
pgaA
ybdZ
artM
ddpC
nhaR
smtA
hisJ
entS
ygdB
soxS r o b
argG
g lgC
garL
dctR
ddpB
astA
omrA wzc
aroG
yhiD
cbl
flgMsoxR
seqA
tppB
rutR
iraP
ycaC
argA
agaR
argD
ccmA
fimD
mdtE
rcsD
fliD
fliP
yhdW
marA
fliS gltJ
ydeO
entB
envY
hha
livMmdtF
rplU
pgaC
rprA
emrK
ddpF
garP
cysA
hdfR
argE
hchA
glnK
fur
rutFflgB
fepE
artP
fiu
slp
astC
csgD
yhjH
fimA
garR
gltL
gcvT yjbF
smpA
pepAflgA glnG
astE
dppC
hisQ
tomB
entC
yeaG
gadE
gadX
pyrD
hydN
yrbA
fliM
dadX
astD
rutD
wcaA
yhjAflhC
fimF
rcsA glgP
fes
flhD hypB
gspB
rpsO l r p adiY
pncB
argB
rutB p n p
n f o ssuB rutA
fliQ
gabT
rrfA
artQ
yjbG
ilvL fimI
ydiV
rutG
gabD
uxuA
gadW
argH
ompC
ydeP
putA
fliY ilvD
fliZ
yfiD
serA
rcsB
gltB
ompF
argR torC
rutC
hycE
csiD
ssuD
yjjP
oppA
yjbE
rnpB micF
pspC
cydD
cadA
ilvM
srlD
entD
katG
hns
trpR
argF
oppF
flgC
cirA
carA
ilvA
sdhB
artI
argC
mlrA
acrF
argO
nupC b d m
fdnH ccmG
sdiA
mglC
aroL
yqjI
treR
omrB
artJ
galK
carB
gdhA
gltF
nrdR
pspEpspGkatE
putP
cyoC
ssuCccmF
kbaZ
csiR
guaA
rstA
sucA
lacA
csgG
m glB
ftsA
glnQ
hofO
fumC
torD
torR
narH csgF
lipA
hycG
osmB relAgalM
stpA
ssuE
caiD
galR
yecR glnP
aroM
gadC
rrsG
hupA rrsH
csgE
dadA
galE
dps fimC
ccmB
marB
fecC uxuB entE
pspB
marR
trmD
lacI
nuoB
gadB srlR
dcuD fixC
rpsS
serC
rutE
z w f glnLsufS
glnH
pspD
gltD
gabP
rrlA
fecD
ompR
srlB
ftsZ
nrdD
rtcR
oppB
entA
sufA
nrdG sufC
yaiA
agaV
ndh
cyoB
gntP pdhR
napB ccmH
ulaF
tyrA
fecB tnaC
ansB
mlacY
tr
sucC
mglA cydA
ompW
fdnI cysJnikD
wzyE
aroP
napC
sufB
hypD
iscA
hyfB
napG
metY
glnA
nagA
melA
yjbHfecE rpoS
pgm
hypE
amiA
guaB
gutQ
aroA
acs
rtcA
tyrR
ssuA
glgA
nrfA
torA
l pdppD
d
iscU
hycF adiA
ccmD
tar
gcd
ilvE
pspF nrfG
rnlA
hupB
tnaB
bcsZ
sodA
feoB
glcG ydhT
nuoA sufD
aroF
rtcB ibpB
glpA
feaB
osmE
sra
rpsP
napA
rffA
napF
sohB
rybB
nrdB gutM
nrfF
uxaB
oppD caiA fimH cadB
tehA
glpB
rbsK
ppiA
bglB
chbG
paaJ
entF
galS
rplD
cheB dmsD
glcF
rrsC
ftsQ
sodB
bcsB
hypA
ccmC nuoI
feaR
malQ deoA
nuoF
fepA
sdhD
uxaC
bolA
paaI
ynfK
dmsC nirD
napD
tyrB
feoC malS
infC
yfgF
glpG
proL
truB
nagD
glcAinfB
dkgB
wzzE
ynfE
osmY
srlE
folA
hypF
cyoD malP fimB
galP
yiaN
uidR malZ
fumB
uxuR
hyfG
rbfA
sdhA
hycI
tdcC nusA
nrfB nrfD
hcr
pheS
cheZ
tyrP
treB sgbH
fimG
nrfE dppF
caiB
ihfA
tnaA
uidC
thrS
nsrR
bglF
glpC
tdcF
cysI narP
ihfB
napH
feoA
nuoM gadA
atoD
ccmE
rffH
chbA
uidA
fixB
cadC
rpoH
cyoAcyoE
dmsB
ytfE
csiE
galT
sucD hyfC
cspA
hybB
ygbA
fumA
cyaA
melR
rbsR
nrfC
cheR
tehByjgI
focB
sufE
fnr
glmY fadL hyfR aceE
pspA nuoK
hybA
ptsI
hycB
sgbE
chbF
tdcE
srlA
rfe
caiF
lacZ
ynfG
malM
hycC
ydhX
fecA
dppB acnA
modE fhlA
creD
rplP
ydhU
tdcD
hlyE
ydhViscR
moeBmoaA
cydC hyaC
frdB
mdh
uxaA ulaA
nuoG
hyfF
dmsA
chpA
caiE
ptsG
yiaK gltA
araF
mhpA
rpmI rffD rffC
modB
uidB
exuR
oppC tdcA
scpA
cysG
modA
bglG
rbsD
atoE
glcB
norR
malY
rffM
ydiU
araD
paaC
norW
ydhW
narL
fucO
araG
uraA
paaX
creB
dcuR
mhpF
adhE
treC
hemF
prfA
nanC
creA
xdhC
aldB
paaH
arcA glcCh p t yjcH
nikR
melB
narK
nanR
crp
sbmC
paaA paaF
narI
aceB tdcRulaC
cheY
lsrC lsrG nagC
sdhC fis
cstA
frdD
glpT
ivbL
malE
cytR
hyfD
yeaR hycD
nikA narX
araE
ydeJ
ulaE
norV
appY
nuoN
dinI
yeaE grxD
nagB
sucB
fixA
glcD
atoB
rbsC
iscS
chpR
acrE actP
uvrA
hipB
rbsA chbR
ynfF
cdd
rffG
yiaO
dacC
rbsB
erpA
gyrA
ygjG
lsrD
glpR paaE
manZu d p
hipA hemA hypC cydB
aceK
nanM
ynfH
moaE
grpE
dgsA
pheM
malT
deoD
frdC
yeiL
narG
hyfI
mhpD
nirC
dusB
nirB
mhpE
hyaFdctA
icd
aldA
dnaA
yhcH
nikC uspB
mhpB
nikB
malG
yiaJ
aceF
rffE
mhpR
malI
ydhO
fucP
ulaG
fdhF
manX
fdnG hybF
paaD
acnB
hyaB
yoaG hyfA
bglJxylR
proP
ackA
creC
nuoE
lamB
hyaD
ydbD
asr
malK
moaB
agp
nanE
hcp
nuoL
dcuB ulaRyiaL
chbB
paaG
t p x nuoJ
sfsA
hycA
ydhY
trg
tdcB
hofB
rpmC
glmS
rplC
yjjQ
pheT mhpT
aceA
ssb
wzxE
ybdN
nagE
nuoH
atoA
appC
dksA
argP
nanK
xseA leuO
tadA
ulaD
hyfH
paaB
focA
nikE
ubiC
xylA
ecpD nanT
glpX
pstC
ubiA
glpD
tsgA
gatC ulaB
nfuA yiaM
chbC
fucR
lsrBmhpC
moaC
glmU
narJ
crr
tam
fucU
fucI
dppA
gatZ
deoB
glpQ
pflB
upp
hybC appAprmC
deoC
idnR
ilvB
ampC
hyfE pstB
speC
ybfN
rpsJ
lsrR
betT
gatD
rplT
dcuA caiC
lldP
fucA
rpsQ
nrdA fruR
rplB moeA
malF
hyfJ
rhaT
manY
nuoC
cueR
phoU
trmA
rpsC
aspA
araJ t r x A
dcuShyaA
hyaE
araC
hofN
copA
thrT
deoR
gntT
xylF
serT
yccBhycH
modCpstA
rplW
prpE
appB
paaK
exuT
tsx
tapmoaDatoC
araH
malX
prpD
htrE
lsrA
hybErimM
fadH
dnaN
rplV
araB
pitA
leuW
puuA
recF
xylH
fixXfadD
envZ
polA
hofM
ubiG
mtlA nanA
caiT
gapA
hybD
ppdD
ptsH
lsrK
gatY
rpsI
flxA
argK
xdhA dcuC
cueO
ychO
uspA
rplS
gatA xylG
murQ
araA
idnK
prpR
puuR
aer
rhaS
fadB
yhfA
pstSrplM
queA
idnD
hybG fadI
ysgA
dsdX
iclR
gntK
idnO
hybO
tdcG
lsrF
fucK
mtlD
fadA
glpK
gntX
pepT
glpF
mpl
ascF
ychH
betA
dsdC
xdhB
spf
cspI
msrA gatB
ydeA
fbaA
scpC
thrU
betI
ugpC
frdA
lldR
argU
ogt
fadJ
pgk
apaH
idnT
gntR
apaG glgS
prpB nupG
hofP
cpdB
ompX
mazG
ompA
rhaB
ugpE
argW
mtlRzraSprpC
gntU
rhaD
dapB
proM
topA
fadR yibD
glpE
murP
leuL
fadE
lldD
ascB
cyaR
psiE xylB
hisR
zraR
gyrB
ugpA
thrW
gltX
ugpQ
uhpT
fruK
betB
proK
sgbU
epd
glyU
puuD
ugpB
hofC
leuC
rhaA
envC
dsdA
leuP
dsrA
eda
ascG
pdxA
dpiA
ilvN
rhaR
slyA
argX
leuX
yahA
scpB
yibQ
fruA
leuA
edd
tyrU
glk
leuB
fbaB
fruB
pfkApykF
mprA
ybaO
cysN
oxyS
acrA
nemR
ybiS
cysC
purR
cdaR
qseB
codB
codA
gor
mnmG
rcnR
mntH
grxA
mioC
lon
ygaC
fhuC
nadA
fepC
fhuE
fepG
gpmA
cspD
livK
yhhY
fliG
yncE
yhdY
potH
nemA
trxC
dsbG
fliE
mltF
pgi
cysD
fldB
gcvA
gltK
agaD
agaI
agaS
kbaY
agaB
agaC
trpL
trpC
trpE
trpB
trpA
aroH
trpD
dinF
lpxC
symE
hokE
dinB
murG
ydjM
ftsI
ybfE
umuD
ddlB
yafQ
mraY
dnaG
rpsU
dinG
phr
murD
ruvA
ftsW
polB
insK
recA
uvrC
uvrD
lexA
cho
murE
dinD
umuC
recX
ftsK
sulA
yafO
uvrB
murF
uvrY
ftsL
ruvB
rpoD
yafN
tisB
yafP
yebG
dinQ
dinJ
recN
murC
gpmM
tpiA
glyT
leuD
phoB
pck
eno
uhpA
setA
citC
sgrS
htpG
citG
citX
dpiB
kdgR
citD
citF
phnE
ppc
zraP
fabA
citE
phoE
fabB
phnO
phoA
phoH
phoR
phnN
phnI
psiF
phnD
fabR
phnH
phnM
phnJ
pitB
phnG
amn
phnL
sgrR
phnF
phnK
phnC
phnP
thiQ
thiP
tbpA
Figure: E. coli regulatory network
8
DREAM network inference challenge
Network inference challenge:
DREAM5 results:
Method
GENIE31
ANOVerence2
Naive TIGRESS
1
2
Network 1
AUPR
AUROC
0.291
0.815
0.245
0.780
0.301
0.782
Huynh-Thu et al., 2010
Kueffner et al., 2012
Network 3
AUPR AUROC
0.093
0.617
0.119
0.671
0.069
0.595
Network 4
AUPR
AUROC
0.021
0.518
0.022
0.519
0.020
0.517
Overall
40.28
34.02
31.1
9
Purposes
Introduce TIGRESS: Trustful Inference of Gene REgulation using
Stability Selection.
Assess the impact of the parameters.
Test and benchmark TIGRESS on several datasets.
10
Outline
1
Methods
Regression-based inference
TIGRESS
Material
2
Results
In silico network results
In vitro networks results
Undirected case: DREAM4
3
Conclusion
11
Regression-based inference: hypotheses
Notations
ntf transcription factors (TF), ntg target genes (TG), nexp experiments
Expression data: X (nexp × (ntf + ntg )).
Xg : expression levels of gene g.
XG : expression levels of genes in G.
Tg : candidate TFs for gene g.
Hypotheses
1
The expression level Xg of a TG g is a function of the expression
levels XTg of Tg :
Xg = fg (XTg ) + ε.
2
A score sg (t) can be derived from fg , for all t ∈ Tg to assess the
probability of the interaction (t, g).
12
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
TF 2
...
TF ntf
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
TF 2
...
TF ntf
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
TF 2
0.97
...
...
TF ntf
0
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
0.23
TF 2
0.97
...
...
...
TF ntf
0
0
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
0.23
0
TF 2
0.97
0.03
...
...
...
...
TF ntf
0
0
0
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
0.23
0
...
0.11
TF 2
0.97
0.03 ...
0
...
...
...
...
...
...
TFntf
0
0
0
...
0.76
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
0.23
0
...
0.11
TF 2
0.97
0.03 ...
0
...
...
...
...
...
...
TF ntf
0
0
0
...
0.76
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
0.23
0
...
0.11
TF 2
0.97
0.03 ...
0
...
...
...
...
...
...
TF ntf
0
0
0
...
0.76
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
0.23
0
...
0.11
TF 2
0.97
0.03 ...
0
...
...
...
...
...
...
TF ntf
0
0
0
...
0.76
2
Rank the scores altogether:
TF 12 → TG 17
TF 23 → TG 5
TF 2 → TG 1
...
...
...
Threshold to a value or a given number N
3
1
0.99
0.97
...
of edges.
13
Adding a linearity assumption
TIGRESS’ first hypothesis: regulations are linear
Xg = fg (XTg ) + ε = XTg ω g + ε
g
Consequence: if ωt = 0, no edge between g and t.
14
Adding a sparsity assumption
TIGRESS’ second hypothesis: few TFs regulate each TG
g
∀g, ]{ωt 6= 0} << ntf
Possible algorithm: LARS with L steps => L TFs selected.
15
Stability Selection
Problem: LARS efficiency is limited:
bad response to correlation;
no confidence score for each TF.
Solution: Stability Selection with randomized LARS (Bach, 2008 ;
Meinshausen and Bühlmann, 2009):
Resample the experiments: run LARS many (e.g. 1, 000) times
with different training sets.
“Resample” the variables: also weight the variables
Xit ← Wt Xit
(1)
where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more
randomized the variables.
Get a frequency of selection for each TF.
16
Stability Selection
Problem: LARS efficiency is limited:
bad response to correlation;
no confidence score for each TF.
Solution: Stability Selection with randomized LARS (Bach, 2008 ;
Meinshausen and Bühlmann, 2009):
Resample the experiments: run LARS many (e.g. 1, 000) times
with different training sets.
“Resample” the variables: also weight the variables
Xit ← Wt Xit
(1)
where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more
randomized the variables.
Get a frequency of selection for each TF.
16
Stability Selection path
1
Frequency of selection of each TF
over the subsamplings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
8
10
12
14
16
18
20
Number L of LARS steps
(example for one target gene)
17
Scoring
1
Frequency of selection of each TF
over the subsamplings
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
8
10
12
14
16
18
20
Number L of LARS steps
18
Scoring
Choose L, then:
Original scoring
Area scoring (contribution)
18
Scoring
Let Ht be the rank of TF t. Then,
score = E[φ(Ht )]
with
Original: φ(h) = 1 if h ≤ L , 0 otherwise
Area: φ(h) = L + 1 − h if h ≤ L , 0 otherwise
=> Area scoring takes the value of the rank into account.
18
TIGRESS Summary
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1
For each TG, score all ntf candidate interactions:
2
TG 1 TG 2 TG 3
LARS
TF 1
0.23
0
+ Stab. Selection
TF 2
0.97
0.03
=>
+ Choose L
...
...
...
...
+ Score
TF ntf
0
0
0
Rank the scores altogether:
TF 12 → TG 17
1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
...
...
...
...
Threshold to a value or a given number N of edges.
3
...
...
...
...
...
TG ntg
0.11
0
...
0.76
19
Parameters
TIGRESS needs four parameters to be set:
scoring method (original, area, ...);
number of runs R: large;
randomization level α: between 0 and 1;
number of LARS steps L: not obvious.
20
Data
Network
DREAM5 Net 1 (in-silico)
DREAM5 Net 3 (E. coli)
DREAM5 Net 4 (S. cerevisiae)
E. coli Net from Faith et al., 2007
DREAM4 Multifactorial Net 1
DREAM4 Multifactorial Net 2
DREAM4 Multifactorial Net 3
DREAM4 Multifactorial Net 4
DREAM4 Multifactorial Net 5
] TF
195
334
333
180
100
100
100
100
100
] Genes
1643
4511
5950
1525
100
100
100
100
100
] Chips
805
805
536
907
100
100
100
100
100
] Edges
4012
2066
3940
3812
176
249
195
211
193
21
Outline
1
Methods
Regression-based inference
TIGRESS
Material
2
Results
In silico network results
In vitro networks results
Undirected case: DREAM4
3
Conclusion
22
Impact of the parameters: results on in silico network
Area (1,000 runs)
L
20
15
15
10
10
5
5
0.1
L
20
0.9
0.1
20
15
10
10
5
5
20
L
0.3
0.5
0.7
Area (4,000 runs)
15
0.1
0.3
0.5
0.7
Area (10,000 runs)
0.1
20
15
10
10
5
5
0.3
0.5
α
0.7
0.9
40
0.1
60
0.3
0.5
0.7
Original (4,000 runs)
0.9
Area less sensitive than
original to α and L.
Area systematically
outperforms original.
0.9
15
0.1
Original (1,000 runs)
20
80
0.3
0.5
0.7
0.9
Original (10,000 runs)
0.3
0.5
α
0.7
The more runs, the better
Best values:
α = 0.4, L = 2, R = 10, 000.
0.9
100
Overall Score
23
1000
200
How
to choose L?
500
0
0
0
0
800 200
100
2
4
6
2
4
6
First 50,000
edges
First 100,000
edges
8
8
600 150
0
0
5
10
15
First 100,000 edges
150
100
400 100
200
0
0
50
50
0
10
1020
3020
40 30 50
L=2: number of TFs/TG smaller
and more variable.
0
0
50
100
150
L=20: greater number of TFs/TG,
less sparsity.
=> L should depend on the expected network’s topology
24
TIGRESS vs state-of-the-art
1
0.8
0.8
GENIE3
TPR
0.6
Precision
1
ANOVerence
0.4
CLR
0.2
ARACNE
Naive TIGRESS
0.6
0.4
0.2
TIGRESS
0
0
0.2
0.4
0.6
FPR
0.8
1
0
0
0.05
0.1
0.15
Recall
TIGRESS is competitive with state-of-the-art.
25
Results on E. coli network
1
0.8
0.8
Precision
1
TPR
0.6
GENIE3
0.4
0.6
0.4
CLR
ARACNE
0.2
0.2
TIGRESS
0
0
0.5
FPR
1
0
0
0.05
0.1
0.15
Recall
Random Forests-based and DREAM winner GENIE3 overperforms all
methods.
26
False discovery analysis on E. coli
Main false positive pattern found by TIGRESS:
Should find
Does find
Good news: spuriously inferred edges close to true edges
Bad news: confusion (due to linear model?)
27
Undirected case: DREAM4 challenge
DREAM4: undirected networks (TFs not known in advance)
A posteriori comparison of Default TIGRESS and GENIE3:
Method
GENIE3
TIGRESS
Network 1
AUPR
0.154
0.165
AUROC
0.745
0.769
Network 2
AUPR
0.155
0.161
AUROC
0.733
0.717
Network 3
AUPR
0.231
0.233
AUROC
0.775
0.781
Network 4
AUPR
0.208
0.228
AUROC
0.791
0.791
Network 5
AUPR
0.197
0.234
AUROC
0.798
0.764
Overall scores:
GENIE3: 37.48
TIGRESS: 38.85
28
Outline
1
Methods
Regression-based inference
TIGRESS
Material
2
Results
In silico network results
In vitro networks results
Undirected case: DREAM4
3
Conclusion
29
Conclusion
Contributions:
Automatization and adaptation of Stability Selection to GRN
inference.
Area scoring setting: better results and less elasticity to parameters.
Insights on network’s behavior
TIGRESS is:
Linear
Competitive
Parallelizable
Available (http://cbio.ensmp.fr/tigress)
But outperformed in some cases by random forests: limits of
linearity?
Perspectives:
Adaptive/changeable value for L.
Group selection of TFs.
Use of time series/knock-out/replicates information.
30
Molecular signatures for breast
cancer prognosis
ACH, Gestraud, P. and Vert, J.-P., The Influence of Feature Selection
Methods on Accuracy, Stability and Interpretability of Molecular
Signatures, 2011, PLoS ONE
ACH, Jacob, L. and Vert, J.-P., Improving Stability and Interpretability of
Molecular Signatures, 2010, arXiv 1001.3109
31
Motivation
Prediction of breast cancer outcome
Assist breast cancer prognosis based on gene expression.
Avoid adjuvant/preventive chemotherapy when not needed.
Gene expression signature
Data: primary site tumor expression arrays.
Among the genome, find the few (50-100) genes sufficient to
predict metastasis/relapse.
Main challenge: high-dimensional data (few samples, many
variables).
32
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33
Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33
Contributions
1
Systematic comparison of feature selection methods in terms of:
predictive performance;
stability;
biological stability and interpretability.
2
Group selection of genes:
with predefined groups (Graph Lasso);
with latent groups (k-overlap norm).
3
Evaluation of Ensemble methods
34
Evaluation
Accuracy: how well selected genes + classifier predict metastatic
events on test data.
Stability: how similar two lists of genes are.
Interpretability: how much biological sense selected genes make.
35
Data
Four public breast cancer datasets from the same technology (Affymetrix
U133A):
GEO Reference
GSE1456
GSE2034
GSE2990
GSE4922
] genes
12,065
12,065
12,065
12,065
] samples
159
286
125
249
] positives
40
107
49
89
36
Outline
1
A simple start
2
An attempt at enforcing stability: Ensemble methods
3
Using prior knowledge: Graph Lasso
4
Acknowledging latent team work
5
Conclusion
37
Classical feature selection/feature ranking methods
Type
Filters
Wrappers
Embedded
Characteristics
Univariate, fast
Only depend on the data
Do not use loss function
Learning machine as a criterion
Computationally expensive
Learning + selecting
Possible use of prior knowledge
Examples
T-test, KL Divergence
Wilcoxon rank-sum test
SVM RFE, Greedy FS
Lasso, Elastic Net
Each of these algorithms returns a ranked list of genes to be
thresholded.
38
First results
100 genes over four databases.
0.16
Random
Ttest
0.14
Entropy
Bhattacharryya
0.12
Wilcoxon
SVM RFE
Stability
0.1
GFS
0.08
Lasso
E−Net
0.06
0.04
0.02
0
0.58
0.59
0.6
0.61
Accuracy
0.62
0.63
0.64
0.65
39
First conclusions
0.16
Random
Ttest
0.14
Entropy
Bhattacharryya
0.12
Wilcoxon
SVM RFE
Stability
0.1
GFS
0.08
Elastic Net neither more stable
nor more accurate than Lasso.
Lasso
E−Net
0.06
0.04
Accuracy/Stability trade-off
0.02
0
Random better than tossing a
coin.
0.58
0.59
0.6
0.61
Accuracy
0.62
0.63
0.64
0.65
T-test: both simplest and best.
Next step:
Can we have a better stability without decreasing accuracy?
40
First conclusions
0.16
Random
Ttest
0.14
Entropy
Bhattacharryya
0.12
Wilcoxon
SVM RFE
Stability
0.1
GFS
0.08
Elastic Net neither more stable
nor more accurate than Lasso.
Lasso
E−Net
0.06
0.04
Accuracy/Stability trade-off
0.02
0
Random better than tossing a
coin.
0.58
0.59
0.6
0.61
Accuracy
0.62
0.63
0.64
0.65
T-test: both simplest and best.
Next step:
Can we have a better stability without decreasing accuracy?
40
Outline
1
A simple start
2
An attempt at enforcing stability: Ensemble methods
3
Using prior knowledge: Graph Lasso
4
Acknowledging latent team work
5
Conclusion
41
Ensemble Methods
Run each algorithm R times on subsamples.
Get R ranked lists of genes (r b )b=1...R .
Aggregate and get a score for each gene:
S(g) =
R
1X b
f (rg ).
R
b=1
average : f (r ) = (p − r )/p
exponential : f (r ) = exp(−αr )
stability selection : f (r ) = δ(r ≤ k )
Sort S by decreasing order and threshold to get final signature
42
Results
100 genes over four databases.
Random
0.2
Ttest
Entropy
0.18
Bhattacharryya
0.16
Wilcoxon
SVM RFE
Stability
0.14
GFS
0.12
Lasso
E−Net
0.1
Single−run
0.08
Stab. sel
0.06
0.04
0.02
0
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
Accuracy
43
Functional stability
100 genes over four databases.
Random
Ttest
0.3
Entropy
Bhattacharryya
Functional stability
0.25
Wilcoxon
SVM RFE
0.2
GFS
Lasso
0.15
E−Net
Single−run
0.1
Stab. sel
0.05
0
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
Accuracy
44
Ensemble methods: conclusions
Random
0.2
Random
Ttest
Entropy
Bhattacharryya
0.16
Bhattacharryya
0.25
Functional stability
Wilcoxon
SVM RFE
0.14
Stability
Ttest
0.3
Entropy
0.18
GFS
0.12
Lasso
E−Net
0.1
Single−run
0.08
Wilcoxon
SVM RFE
0.2
GFS
Lasso
0.15
E−Net
Single−run
0.1
Stab. sel
Stab. sel
0.06
0.05
0.04
0.02
0
0.57
0
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.58
0.59
0.6
0.65
Accuracy
0.61
0.62
0.63
0.64
0.65
Accuracy
Expected improvement in stability not happening.
Slight improvement in accuracy in some cases.
Loss in functional stability.
T-test: still the preferred method.
Next step:
Can we do better by incorporating prior knowledge?
45
Ensemble methods: conclusions
Random
0.2
Random
Ttest
Entropy
Bhattacharryya
0.16
Bhattacharryya
0.25
Functional stability
Wilcoxon
SVM RFE
0.14
Stability
Ttest
0.3
Entropy
0.18
GFS
0.12
Lasso
E−Net
0.1
Single−run
0.08
Wilcoxon
SVM RFE
0.2
GFS
Lasso
0.15
E−Net
Single−run
0.1
Stab. sel
Stab. sel
0.06
0.05
0.04
0.02
0
0.57
0
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.58
0.59
0.6
0.65
Accuracy
0.61
0.62
0.63
0.64
0.65
Accuracy
Expected improvement in stability not happening.
Slight improvement in accuracy in some cases.
Loss in functional stability.
T-test: still the preferred method.
Next step:
Can we do better by incorporating prior knowledge?
45
Outline
1
A simple start
2
An attempt at enforcing stability: Ensemble methods
3
Using prior knowledge: Graph Lasso
4
Acknowledging latent team work
5
Conclusion
46
Data
Expression data: Van’t Veer et al.,
2002; Wang et al., 2005.
PPI network with 8141 genes (Chuang
et al., 2007)
Assumption: genes close on the graph behave similarly
Idea: instead of selecting single genes, select edges
47
Selecting groups of genes: `1 methods
Lasso : selects single genes (Tibshirani, 1996)
Why groups?
selecting similar genes: improving stability and interpretability
smoothing out noise by "averaging": improving accuracy
Group Lasso (Yuan & Lin, 2006): implies group sparsity for groups
of covariates that form a partition of {1...p}
Overlapping group Lasso (Jacob et al., 2009): selects a union of
potentially overlapping groups of covariates (e.g. gene pathways).
Graph Lasso: uses groups induced by the graph (e.g. edges)
48
Is group sparsity enough?
`1 methods work well, but face serious stability issues when
groups are correlated.
Solution: randomization through stability selection.
49
Accuracy results
Test on data from Wang et al., 2005.
Signature learnt on Van’t Veer dataset
Balanced Accuracy
0.8
Signature learnt on Wang dataset
0.6
0.4
0.2
0
Lasso
Lasso + stab. sel.
Graph Lasso
Graph Lasso + stab. sel.
Neither prior knowledge nor stability selection bring any
improvement!
50
Number of genes in the overlap
Stability results
7
6
5
Lasso
Graph Lasso with stability selection
Lasso with stability selection
Graph Lasso
4
3
2
1
0
0
20
40
60
80
100
Number of genes in the signatures
120
Graph Lasso slightly improves stability.
51
Interpretability
Signature obtained using Lasso:
52
Interpretability
Signature obtained using Graph Lasso + Stability Selection:
52
Graph Lasso conclusion
Graphical prior seems to increase stability and interpretability.
However: no change in accuracy.
Next step:
Grouping increases stability. Now on to accuracy!
53
Outline
1
A simple start
2
An attempt at enforcing stability: Ensemble methods
3
Using prior knowledge: Graph Lasso
4
Acknowledging latent team work
5
Conclusion
54
Latent grouping
Grouping genes makes sense.
Let the data tell which genes to select together.
The k-support norm
Introduced by Argyriou et al., 2012
A trade-off between `1 and `2 .
Equivalent to overlapping group Lasso (Jacob et al., 2009) with all
possible groups of size k .
Results in selecting groups that are not predefined.
55
Extreme randomization
Following Breiman’s random forests
Sample both the examples and the covariates.
Less variables = less correlation.
Give each gene a chance to be selected.
Extreme randomization
For each of the R runs:
Bootstrap samples (classical Ensemble method)
Sample the covariates: randomly choose 10% of them.
Run FS procedure on the restricted data.
=> Compute frequency of selection: P(g selected |g preselected)
56
Accuracy or stability?
0.05
T−test
Random
0.045
Ttest
0.04
0.035
Lasso
ENet
kSupport (k=2)
Stability
0.03
kSupport (k=10)
0.025
0.02
kSupport (k=20)
Single−run
Extreme Rand. + SS
0.015
0.01
0.005
0
0.62
0.625
0.63
0.635
0.64
0.645
0.65
0.655
Accuracy
57
Stability or redundancy?
0.05
Random
0.045
0.04
Ttest
Lasso
ENet
Stability
0.035
0.03
kSupport (k=2)
kSupport (k=10)
kSupport (k=20)
0.025
Single−run
Extreme Rand. + SS
0.02
0.015
0.01
0.005
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Correlation
58
Conclusions
0.05
0.05
T−test
Random
Random
0.045
0.045
Ttest
0.04
0.035
Lasso
0.04
0.035
kSupport (k=2)
kSupport (k=10)
0.02
Stability
Stability
0.03
0.025
Ttest
Lasso
ENet
ENet
kSupport (k=20)
0.03
kSupport (k=2)
kSupport (k=10)
kSupport (k=20)
0.025
Single−run
Single−run
Extreme Rand. + SS
Extreme Rand. + SS
0.02
0.015
0.015
0.01
0.01
0.005
0
0.62
0.625
0.63
0.635
0.64
0.645
0.65
0.655
0.005
0.01
0.02
0.03
0.04
Accuracy
0.05
0.06
0.07
0.08
Correlation
Extreme Randomization improves accuracy.
Grouping improves stability.
But: both effects do not add up that well (redundancy)
T-test still the best trade-off?
59
Outline
1
A simple start
2
An attempt at enforcing stability: Ensemble methods
3
Using prior knowledge: Graph Lasso
4
Acknowledging latent team work
5
Conclusion
60
Signature selection: conclusion
Contributions:
Step-by-step study of FS methods behavior on several breast
cancer datasets.
Systematic analysis of accuracy, stability, interpretability.
Insights on the accuracy/stability trade-off.
What have we learned?
Best methods: simple t-test or complex black box
Grouping improves gene and functional stability.
Randomization improves accuracy (sometimes) but has unwanted
effects on stability.
Accuracy/Stability Trade-off: Stability => redundancy => lower
accuracy.
61
Signature selection: perspectives
One unique signature?
single breast cancer subtype
many samples
larger signature
Is expression data sufficient?
probably not all information is there
clinical data: same accuracy (same information?)
possibly look at genotype, methylation, clinical and expression
Is stability important?
not as important as accuracy + prediction concordance
possibly not even achievable
62
Conclusion
63
Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64
Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64
Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64
Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64
Acknowledgements
Fantine
Mordelet
Paola
Vera-Licona
Pierre
Gestraud
Laurent
Jacob
65
The k-support norm
It can be shown that:

k −r
−1
X
(|ω|↓i )2 +

Ωsp
k (ω) =
i=1
1
r +1
d
X
!2  12
|ω|↓i

i=k −r
where r is the only integer in {0, . . . , k − 1} statisfying
|ω|↓k −r −1 >
d
1 X
|ω|↓i ≥ |ω|↓k −r .
r +1
i=k −r
and |ω|↓i is the i-th largest value of |ω| (|ω|↓0 = +∞).
66
Link with overlapping Group Lasso
The k-support norm is equivalent to the overlapping group Lasso
norm


X

X
Ωsp
(ω)
=
min
||v
||
:
supp(v
)
⊆
I,
v
=
ω
I 2
I
I
k

v ∈Rp×Gk 
I∈Gk
I∈Gk
where Gk denotes all subsets of {1, . . . , d} of cardinality k .
Remark 1: it selects at least k variables.
Remark 2: the first selected group consists of the k variables most
correlated with the response
67
ADMM - applied to k-support problem
Our problem
2
minω,β R̂l (ω) + λ2 Ωsp
k (β)
s.t. ω − β = 0
Augmented Lagrangian
ρ
2
0
2
Lρ (ω, β, µ) = R̂l (ω) + λ2 Ωsp
k (β) + µ (ω − β) + 2 ||ω − β||
Algorithm
1
2
Initialize: β (1) , ω (1) , µ(1)
for t = 1, 2, ..., do
o
n
w (t+1) = arg minw R̂l (w) + µ(t)T w + ρ2 ||w − β (t) ||2
(t)
β (t+1) = prox λ Ωsp (.)2 w (t+1) + µρ
2ρ
k
µ(t+1) = µ(t) + ρ(w (t+1) − β (t+1) )
68
ADMM - optimality conditions
Three first order conditions:
Primal condition: ω ∗ − β ∗ = 0
Dual condition 1: ∇R̂l (ω ∗ ) + µ∗ = 0
∗ 2
∗
Dual condition 2: 0 ∈ λ2 ∂Ωsp
k (β ) + µ
Resulting in a definition for the residuals at step t + 1:
Primal residuals: r (t+1) = ω (t+1) − β (t+1)
Dual residuals: s(t+1) = ρ(β (t+1) − β (t) )
As the algorithm converges, the (norm of the) residuals tend to zero.
69
ADMM - choice of parameter
Parameter ρ is critical in ADMM: it controls how much variables change.
It can be seen as a step size.
How to choose it?
Adaptive ADMM
One solution
the problem:
 is to let(t)it adapt to
if ||r (t+1) ||2 > η||s(t+1) ||2 and t ≤ tmax
 (1 + τ )ρ
ρ(t+1) =
ρ(t) /(1 + τ ) if η||r (t+1) ||2 < ||s(t+1) ||2 and t ≤ tmax
 (t)
ρ
otherwise
In practice, we use τ = 1, η = 10 and tmax = 100.
=> Adaptive ADMM forces the primal and dual residuals to be of a
similar amplitude.
70
Comparison
k=5
5
0
0
RDG
RDG
k=1
5
−5
−5
−10
−15
0
−10
10
Time (Seconds)
k = 10
5
ADMM − adaptive
ADMM − rho=1
ADMM − rho=10
ADMM − rho=100
FISTA
RDG
RDG
0
−5
−10
−15
0
−15
0
20
10
Time (Seconds)
20
k = 100
5
0
−5
−10
10
Time (Seconds)
20
−15
0
20
40
Time (Seconds)
60
71
Accuracy vs size of the signature
Single−run
Ensemble−Mean
0.65
0.6
AUC
AUC
0.65
0.55
0.6
0.55
0.5
0.5
0.45
0
0.45
0
20
40
60
80
100
40
60
80
100
0.65
AUC
0.65
AUC
20
Ensemble−Stability Selection
Ensemble−Exponential
0.6
0.55
0.6
0.55
0.5
0.5
0.45
0
0.45
0
20
Random
40
T−test
60
Entropy
80
100
Bhatt.
Wilcoxon
20
SVM RFE
40
GFS
60
80
Lasso
100
E−Net
72
Stability vs size of the signature
Single-run
0.6
0.4
0.2
0
0
20
40
60
80
Stability
0.4
0.2
20
Random
40
T test
60
80
Entropy
0.4
0.2
20
40
60
80
100
Ensemble-stability selection
0.8
0.6
0
0
Ensemble-average
0.6
0
0
100
Ensemble-exponential
0.8
Stability
0.8
Stability
Stability
0.8
0.6
0.4
0.2
100
0
0
Bhatt.
Wilcoxon
20
RFE
40
GFS
60
80
Lasso
100
E−Net
73