Feature selection from gene expression data Molecular signatures for breast cancer prognosis and inference of gene regulatory networks. Anne-Claire Haury Centre for Computational Biology Mines ParisTech, INSERM U900, Institut Curie PhD Defense, December 14, 2012 1 Introduction 2 Gene expression Figure: Central Dogma of Molecular Biology (Source: Wikipedia) => RNA as a proxy to measure gene expression. 3 Microarrays: measuring gene expression Microarrays: one option. RNA hybridized onto a chip. Quantification of gene activity in different conditions at the genome scale. Resulting data: gene expression data. samples High expression genes Low expression 4 Feature selection: extract relevant information Genome ≈ 25,000 genes From gene expression, explain a phenomenon Feature selection: deciding which variables/genes are relevant. 5 Feature selection: extract relevant information Genome ≈ 25,000 genes From gene expression, explain a phenomenon Feature selection: deciding which variables/genes are relevant. Mathematically: Explain response Y using variables (Xj )j=1...p Y = f (X1 , X2 , X3 , X4 , . . . , Xp ) + ε 5 Feature selection: extract relevant information Genome ≈ 25,000 genes From gene expression, explain a phenomenon Feature selection: deciding which variables/genes are relevant. Mathematically: Explain response Y using relevant variables amongst (Xj )j=1...p Y = f (X1 , X2 , X3 , X4 . . . , Xp ) + ε Objective 1: more accurate predictors Objective 2: more interpretable predictors Objective 3: faster algorithms 5 Contributions of this thesis Gene Regulatory Network Inference TIGRESS: new method based on local feature selection. Ranked 3rd/29 at DREAM5 challenge. Linear method, competitive with more complex algorithms Molecular signatures for breast cancer prognosis Select biomarkers to predict metastasis/relapse in breast cancer patients. Complete benchmark of feature selection methods. Investigation of the stability issue. 6 alkB alkA ada metI metQ yeiB metC metF asnA metN folE metA metE gudP Gene Regulatory Network Inference with TIGRESS gudX garD purL purB gudD metJ purC prs ompT purE glyA ubiX asnCmgtA rstB purH acrA nemR mltF pgi cysC potH ddpA cysK tauA tauD pqiA fpr fliL tauC astB fliR ddpB astA argA argD artP flgI ygbK pyrC fliH fliN fliS glnK glnG gltL uof flhB fepB exbD hisM rob rutR fliD rutFflgB fepD fecI fecR rcnA avtA nrdF cysW thrV fliA hisJ argG fimD marA cysA fepE hisQ fhuA nrdE wzb uspE ade ftnA oxyR nrdI yhiD ycaC gspO ilvI entS dctR entB emrK garP mdtE ydeO hdfR fur fimA garR wcaB yccA envY rprA astC pepAflgA yhjH rpmA agaC pgaA aroG iraP agaR hha rplU pgaC nfo fliQ artQ putA argC trpL trpC trpE cydD oppF flgC ccmG trpB trpR fdnH yqjI artJ ydiV fliZ argR torC argF artI kbaY agaB pgaB fliM astD argB argH agaI agaS pgaD yciE fiu slp csgD gcvT smpA dppC yjbF tomB entC gadE gadX pyrD hydN yrbA dadX rutD wcaA yhjA flhC fimF rcsA glgP fes flhD hypB gspB rpsO lrp adiY ssuB rutA pncB rutB pnp gabT rrfA yjbG ilvL fimIgadW rutG uxuAgabD ompC ydeP fliY ilvDhycE yfiD serA rcsB gltB ompF rutC csiD ssuD yjjP oppA yjbE rnpB micF pspC cadA ilvM srlD entD katG hns cirA carA ilvA sdhB mlrA acrF argO nupC bdm sdiA mglC treR omrB galK carB katE gdhA gltF nrdR pspE pspG cyoC ssuCccmF kbaZ csiR guaA rstAmglB lacA sucA csgG ftsA glnQ relAgalM hofO fumC torD torR narH csgF lipA hycG osmB stpA ssuE glnP caiD galR yecR gadC rrsG hupA rrsH fimC csgE dadA galE dps ccmB marB fecC uxuB entE pspB marR trmD lacI nuoB gadB srlR dcuD fixC rpsS serC rutE zwf glnLsufS glnH gltD gabP pspD rrlA fecD ompR srlB ftsZ nrdD rtcR oppB entA sufA nrdG sufC agaV ndh cyoB lacY gntP pdhR napB ccmH ulaF fecB tnaC ansB mtr sucC nikD mglA cydA ompW fdnI cysJ wzyE aroP sufB napC hypD iscA hyfB napG metY glnA melA nagA yjbHfecE rpoS pgm hypE amiA guaB gutQ dppDhycF aroA acs tyrR rtcA ssuA glgA nrfA torA lpd iscU ccmD tar adiA ilvE pspF nrfGgcd rnlA hupB tnaB bcsZ sodA feoB glcG ydhT nuoA sufD rtcB ibpB glpA feaB osmE sra rpsP napA rffA napF sohB rybB nrdB gutM uxaB nrfF oppD caiA fimH cadB tehA glpB rbsK ppiA bglB chbG paaJ entF galS rplD cheBdmsD glcF rrsC sodB bcsB hypA ccmC nuoI malQ deoA feaR nuoF fepA sdhD uxaC bolA paaI ynfK dmsC nirD napD feoC malS infC glpG yfgF proL truB nagD glcA infB dkgB wzzE ynfE malP srlE osmY folA hypF galP cyoD fimB uidR malZ fumB uxuR hyfG rbfAyiaN sdhA hycI tdcC nusA nrfB nrfD hcr pheS cheZ treB sgbH fimGtyrP nrfE dppF ihfA caiB tnaA uidC thrS nsrR narP bglF glpC tdcF cysI ihfB napH feoA nuoM gadA atoD ccmE rffH chbA uidA cyoE cadC fixB rpoH cyoAsucD dmsB ytfE csiE hyfC galT cspA hybB ygbA fumA cyaA rbsR melR nrfC cheR tehByjgI focB sufE fnr glmY fadL hyfR aceE pspA nuoK hybA ptsI sgbE hycB chbF tdcE srlA rfe caiF lacZ ynfG malM hycC ydhX fhlA fecA dppB acnA flgF modE creD rplP ydhU tdcD iscR hlyE ydhV moeBmoaA cydC frdB mdh uxaA ulaA nuoG hyfF hyaC dmsA chpA caiE ptsG yiaK gltA araF mhpA rpmI modB rffD rffC uidB exuR oppC scpA cysG modA bglG rbsD atoE tdcA glcB norR malY rffM ydiU araD paaC norW ydhW narL fucO araG uraA paaX creB dcuR mhpF adhE treC hemF prfA nanC creA xdhC aldB paaH arcA glcChpt yjcH nikR melB narK nanR crp sbmC paaA paaF narI aceB tdcRulaC cheY lsrC lsrG nagC sdhC fis cstA frdD glpT ivbL malE cytR hyfD yeaR hycD nikA narX araE ydeJ ulaE norV appY nuoN yeaE grxD nagB sucB fixA glcD atoB rbsC iscS chpR acrEactP uvrA hipB rbsA chbR ynfF cdd rffG yiaO dacC rbsB erpA gyrA ygjG lsrD glpR paaE manZudp hipA hemA hypC cydB aceK nanM ynfH moaE grpE dgsA pheM malT deoD frdC yeiL narG hyfI mhpD nirC dusB nirB mhpE hyaFdctA icd aldA dnaA yhcH nikC uspB mhpB nikB malG yiaJ aceF rffE mhpR malI ydhO fucP ulaG fdhF hyfA manX fdnG hybF paaD acnB hyaB yoaG proP bglJ xylR ackA creC nuoE lamB hyaD ydbD asr malK moaB agp nanE hcp nuoL dcuB ulaR yiaL chbB paaG tpx nuoJ sfsA hycA ydhY trg mhpT tdcB hofB rpmC glmS rplC yjjQ pheT aceA ssb wzxE ybdN nagE nuoH atoA appC dksA argP nanK xseA leuO tadA ulaD hyfH paaB mhpC focA nikE ubiC xylA ecpD nanT glpX pstC ubiA glpD tsgA gatC ulaB nfuA yiaM chbC fucR lsrB glmU moaC narJ crr tam fucU fucI dppA gatZ deoB glpQ pflB hybC upp deoC appAprmC idnR ilvB ampC hyfE pstB speC ybfN rpsJ lsrR betT gatD rplT dcuA caiC lldP fucA rpsQ nrdA fruR rplB moeA malF rhaT manY nuoChyfJ cueR phoU trmA rpsC aspA araJ trxA dcuShyaA hyaE araC hofN thrT deoR gntT xylF serT yccBhycH gcvA modC pstA rplW prpE appB paaK exuT tsx tapmoaDatoC araH malX prpD htrE lsrA hybErimM fadH dnaN rplV fadD araB pitA leuW puuA recF xylH fixX envZ polA ubiG hofM mtlA nanA caiT gapA hybD ppdD ptsH lsrK gatY rpsI flxA argK xdhA dcuC ychO uspA rplS murQ gatA xylG araA idnK prpR aer puuR rhaS fadB yhfA pstSrplM queA idnD hybG fadI ysgA dsdX iclR gntK idnO hybO tdcG lsrF fucK mtlD fadA glpK gntX pepT glpF mpl ascF ychH betA gatB dsdC xdhB spf cspI msrA ydeA fbaA scpC thrU betI ugpC frdA lldR argU ogt fadJ pgk apaH idnT gntR apaG glgS prpB nupG hofP cpdB ompX mazG ompA rhaB ugpE argW mtlRzraSprpC gntU rhaD dapB proM topA fadR yibD glpE murP leuL fadE lldD ascB cyaR psiE xylB hisR zraR gyrB ugpA thrW gltX ugpQ uhpT fruK betB proK sgbU epd glyU puuD ugpB hofC leuC rhaA envC dsdA leuP dsrA eda ascG pdxA dpiA ilvN rhaR slyA argX leuX yahA scpB yibQ fruA leuA edd tyrU glk leuB fbaB fruB pfkApykF astE yeaG agaD cpxP dsbC mcbR csgB seqA tppB rcsD hchA cheA ppiD csgC pepDadrA nhaR smtA omrA wzc livMmdtF nadR yciF nhaA ecnB osmC hemL ybdZ garLglgC degP proW yciG yhiM nadA rseA rpoE cpxR csgA mukE evgS lpxD sdaA proX ilvH ftnB ycfS fabZ gnd evgA fliC yojI livG hdeD aslB yebE lpxA ydeH chiA livF mukF yjjZ nrdH emrY ybaT gadY wza hdeBhdeA nac bacA acrD proV livH garK cysM mdtB yidQ motA psd ryhB ydfN livJ mukB tdh lysU cysP fimE phoP flgN gcvP ygiC pnuC yqjA rseC rdoA baeS baeR cspD livK nadB dsbA cpxA tsr mdtC yhhY asnB kbl fhuF gltI purM gcvH lrhA cysH hmp soxS gltJ acrR flhA flgJ fliK flgD hisP ygdB cbl flgMsoxR ccmA fliP fliF cysB fliJ rimK cysU ybjCflhE flgE recC ycgRnfsB flgH ygiB nfsA ybjNacrB tolC flgG fliI purN ppdA yqjH artM ddpC fliO ppdC poxB ppdB amtB argI ddpX inaA fliT fldA potF potG argE codA metK pqiB gltK rfaZ tauB ribA argT ddpD yeaH ddpF qseB mdtD fepC fhuE fepG ygaC gpmA fhuC lon mntH grxA mioC codB purA fldB yhdX yhdZ cdaR purR fliG yncE yhdY potI nemA ybiS trxC dsbG fliE motB spy fhuB tonB rcnR cheW ung yfdX gor mnmG ybaS exbB metH fhuD cysN yjeP rseB mdtA frc hemH flu purD oxyS ybaO yegR ahpF ybjG slyB yneM glgX borD cueO metL gmr hslJ cysD aidB metB glgB qseC mgrB yrbL envR copA pagP ahpC purK hflD gcvB mprA yhdW mqsR metR purF speA glnB emrA mntR phoQ cvpA speB emrB trpA aroH aroL putP trpD aroM yaiA tyrA aroF dinF lpxC ftsQ symE tyrB hokE dinB murG ydjM ftsI ybfE umuD ddlB yafQ mraY dnaG rpsU ftsW polB insK phr murD ruvA dinG cho murE dinD umuC recX recA uvrC uvrD lexA dinI ftsK sulA yafO uvrB murF uvrY ftsL ruvB rpoD yafN tisB yafP yebG dinQ dinJ recN murC ACH, Mordelet, F., Vera-Licona, P. and Vert, J.-P., TIGRESS: Trustful Inference of Gene REgulation using Stability Selection, BMC Systems Biology, 2012, 6:145. gpmM tpiA glyT leuD phoB pck eno uhpA setA citC sgrS htpG citG citX dpiB kdgR citD citF phnE ppc zraP fabA citE phoE fabB phnO phnJ phoA pitB phoH phoR phnN phnI psiF phnD fabR phnH phnM phnG amn phnL sgrR phnF phnK phnC phnP thiQ thiP tbpA 7 Gene Regulatory Networks Gene regulation: control gene expression. Transcription factors (TF) activate or repress target genes (TG). Gene Regulatory Network (GRN): representation of the regulatory interactions between genes. alkB alkA ada metI metQ yeiB metC metF asnA metN folE metA metE gudP gudX garD purL purB gudD metJ purC prs cvpA ompT purE speB speA emrB glnB glyA ubiX mqsR metR pagP aidB purF ahpC purK hflD glgB qseC mgrB yrbL metL asnCmgtA gcvB emrA mntR phoQ purH ybjG rseB nadB pnuC motB ybaS spy fhuB tonB dsbA ung yfdX exbB metH fhuD cheW mdtA frc hemH flu purD gmr yjeP yegR ahpF slyB yneM glgX borD rstB envR hslJ metB mdtD cpxA rseC yqjA tsr mdtC yidQ rdoA motA baeS baeR psd mdtB ydeH rseA bacA asnB ryhB metK nadR pqiB acrD fhuA yebE rpoE proV ygbK pyrC fepD rcnA fliH ydfN ftnB lpxA avtA flgF yccA purA fecI livH livJ chiA fecR mukB fliO ycgRnfsB livF tdh flgH ycfS uof lysU cheA fabZ cysP mukF potI wzb ppdC poxB uspE rfaZ flhB fepB nrdF cpxP ade fliJ acrR yjjZ cpxR tauB csgA gnd cysW inaA exbD flgI ribA ygiB nfsA fliR ppiD argT k b l fhuF ftnA flhA rimK fliT nrdH mukEevgS lpxD ddpD cysU csgC fldA dsbC mcbR evgA acrB garK gltI ppdB ybjC cysK ybjN tolC potF flhE tauA fimE ddpA amtB sdaA fliC flgJ fliK phoP cysM yciF tauD emrY ybaT flgE yojI proX potG degP pgaD flgG purM gcvH lrhA thrV gadY pepDadrA fpqiA pr wza hdeBhdeA proW yciG yeaH cysH fliI ilvH recC purN fliN nrdE argI hmp yhiM yhdX flgD livG flgN wcaB gcvP aslB fliL ppdA nhaA hisP nac tauC hdeD pgaB yciE yhdZ ygiC astB osmC csgB ddpX fliF ecnB oxyR cysB hisM nrdI gspO hemL ilvI fliA rpmA yqjH pgaA ybdZ artM ddpC nhaR smtA hisJ entS ygdB soxS r o b argG g lgC garL dctR ddpB astA omrA wzc aroG yhiD cbl flgMsoxR seqA tppB rutR iraP ycaC argA agaR argD ccmA fimD mdtE rcsD fliD fliP yhdW marA fliS gltJ ydeO entB envY hha livMmdtF rplU pgaC rprA emrK ddpF garP cysA hdfR argE hchA glnK fur rutFflgB fepE artP fiu slp astC csgD yhjH fimA garR gltL gcvT yjbF smpA pepAflgA glnG astE dppC hisQ tomB entC yeaG gadE gadX pyrD hydN yrbA fliM dadX astD rutD wcaA yhjAflhC fimF rcsA glgP fes flhD hypB gspB rpsO l r p adiY pncB argB rutB p n p n f o ssuB rutA fliQ gabT rrfA artQ yjbG ilvL fimI ydiV rutG gabD uxuA gadW argH ompC ydeP putA fliY ilvD fliZ yfiD serA rcsB gltB ompF argR torC rutC hycE csiD ssuD yjjP oppA yjbE rnpB micF pspC cydD cadA ilvM srlD entD katG hns trpR argF oppF flgC cirA carA ilvA sdhB artI argC mlrA acrF argO nupC b d m fdnH ccmG sdiA mglC aroL yqjI treR omrB artJ galK carB gdhA gltF nrdR pspEpspGkatE putP cyoC ssuCccmF kbaZ csiR guaA rstA sucA lacA csgG m glB ftsA glnQ hofO fumC torD torR narH csgF lipA hycG osmB relAgalM stpA ssuE caiD galR yecR glnP aroM gadC rrsG hupA rrsH csgE dadA galE dps fimC ccmB marB fecC uxuB entE pspB marR trmD lacI nuoB gadB srlR dcuD fixC rpsS serC rutE z w f glnLsufS glnH pspD gltD gabP rrlA fecD ompR srlB ftsZ nrdD rtcR oppB entA sufA nrdG sufC yaiA agaV ndh cyoB gntP pdhR napB ccmH ulaF tyrA fecB tnaC ansB mlacY tr sucC mglA cydA ompW fdnI cysJnikD wzyE aroP napC sufB hypD iscA hyfB napG metY glnA nagA melA yjbHfecE rpoS pgm hypE amiA guaB gutQ aroA acs rtcA tyrR ssuA glgA nrfA torA l pdppD d iscU hycF adiA ccmD tar gcd ilvE pspF nrfG rnlA hupB tnaB bcsZ sodA feoB glcG ydhT nuoA sufD aroF rtcB ibpB glpA feaB osmE sra rpsP napA rffA napF sohB rybB nrdB gutM nrfF uxaB oppD caiA fimH cadB tehA glpB rbsK ppiA bglB chbG paaJ entF galS rplD cheB dmsD glcF rrsC ftsQ sodB bcsB hypA ccmC nuoI feaR malQ deoA nuoF fepA sdhD uxaC bolA paaI ynfK dmsC nirD napD tyrB feoC malS infC yfgF glpG proL truB nagD glcAinfB dkgB wzzE ynfE osmY srlE folA hypF cyoD malP fimB galP yiaN uidR malZ fumB uxuR hyfG rbfA sdhA hycI tdcC nusA nrfB nrfD hcr pheS cheZ tyrP treB sgbH fimG nrfE dppF caiB ihfA tnaA uidC thrS nsrR bglF glpC tdcF cysI narP ihfB napH feoA nuoM gadA atoD ccmE rffH chbA uidA fixB cadC rpoH cyoAcyoE dmsB ytfE csiE galT sucD hyfC cspA hybB ygbA fumA cyaA melR rbsR nrfC cheR tehByjgI focB sufE fnr glmY fadL hyfR aceE pspA nuoK hybA ptsI hycB sgbE chbF tdcE srlA rfe caiF lacZ ynfG malM hycC ydhX fecA dppB acnA modE fhlA creD rplP ydhU tdcD hlyE ydhViscR moeBmoaA cydC hyaC frdB mdh uxaA ulaA nuoG hyfF dmsA chpA caiE ptsG yiaK gltA araF mhpA rpmI rffD rffC modB uidB exuR oppC tdcA scpA cysG modA bglG rbsD atoE glcB norR malY rffM ydiU araD paaC norW ydhW narL fucO araG uraA paaX creB dcuR mhpF adhE treC hemF prfA nanC creA xdhC aldB paaH arcA glcCh p t yjcH nikR melB narK nanR crp sbmC paaA paaF narI aceB tdcRulaC cheY lsrC lsrG nagC sdhC fis cstA frdD glpT ivbL malE cytR hyfD yeaR hycD nikA narX araE ydeJ ulaE norV appY nuoN dinI yeaE grxD nagB sucB fixA glcD atoB rbsC iscS chpR acrE actP uvrA hipB rbsA chbR ynfF cdd rffG yiaO dacC rbsB erpA gyrA ygjG lsrD glpR paaE manZu d p hipA hemA hypC cydB aceK nanM ynfH moaE grpE dgsA pheM malT deoD frdC yeiL narG hyfI mhpD nirC dusB nirB mhpE hyaFdctA icd aldA dnaA yhcH nikC uspB mhpB nikB malG yiaJ aceF rffE mhpR malI ydhO fucP ulaG fdhF manX fdnG hybF paaD acnB hyaB yoaG hyfA bglJxylR proP ackA creC nuoE lamB hyaD ydbD asr malK moaB agp nanE hcp nuoL dcuB ulaRyiaL chbB paaG t p x nuoJ sfsA hycA ydhY trg tdcB hofB rpmC glmS rplC yjjQ pheT mhpT aceA ssb wzxE ybdN nagE nuoH atoA appC dksA argP nanK xseA leuO tadA ulaD hyfH paaB focA nikE ubiC xylA ecpD nanT glpX pstC ubiA glpD tsgA gatC ulaB nfuA yiaM chbC fucR lsrBmhpC moaC glmU narJ crr tam fucU fucI dppA gatZ deoB glpQ pflB upp hybC appAprmC deoC idnR ilvB ampC hyfE pstB speC ybfN rpsJ lsrR betT gatD rplT dcuA caiC lldP fucA rpsQ nrdA fruR rplB moeA malF hyfJ rhaT manY nuoC cueR phoU trmA rpsC aspA araJ t r x A dcuShyaA hyaE araC hofN copA thrT deoR gntT xylF serT yccBhycH modCpstA rplW prpE appB paaK exuT tsx tapmoaDatoC araH malX prpD htrE lsrA hybErimM fadH dnaN rplV araB pitA leuW puuA recF xylH fixXfadD envZ polA hofM ubiG mtlA nanA caiT gapA hybD ppdD ptsH lsrK gatY rpsI flxA argK xdhA dcuC cueO ychO uspA rplS gatA xylG murQ araA idnK prpR puuR aer rhaS fadB yhfA pstSrplM queA idnD hybG fadI ysgA dsdX iclR gntK idnO hybO tdcG lsrF fucK mtlD fadA glpK gntX pepT glpF mpl ascF ychH betA dsdC xdhB spf cspI msrA gatB ydeA fbaA scpC thrU betI ugpC frdA lldR argU ogt fadJ pgk apaH idnT gntR apaG glgS prpB nupG hofP cpdB ompX mazG ompA rhaB ugpE argW mtlRzraSprpC gntU rhaD dapB proM topA fadR yibD glpE murP leuL fadE lldD ascB cyaR psiE xylB hisR zraR gyrB ugpA thrW gltX ugpQ uhpT fruK betB proK sgbU epd glyU puuD ugpB hofC leuC rhaA envC dsdA leuP dsrA eda ascG pdxA dpiA ilvN rhaR slyA argX leuX yahA scpB yibQ fruA leuA edd tyrU glk leuB fbaB fruB pfkApykF mprA ybaO cysN oxyS acrA nemR ybiS cysC purR cdaR qseB codB codA gor mnmG rcnR mntH grxA mioC lon ygaC fhuC nadA fepC fhuE fepG gpmA cspD livK yhhY fliG yncE yhdY potH nemA trxC dsbG fliE mltF pgi cysD fldB gcvA gltK agaD agaI agaS kbaY agaB agaC trpL trpC trpE trpB trpA aroH trpD dinF lpxC symE hokE dinB murG ydjM ftsI ybfE umuD ddlB yafQ mraY dnaG rpsU dinG phr murD ruvA ftsW polB insK recA uvrC uvrD lexA cho murE dinD umuC recX ftsK sulA yafO uvrB murF uvrY ftsL ruvB rpoD yafN tisB yafP yebG dinQ dinJ recN murC gpmM tpiA glyT leuD phoB pck eno uhpA setA citC sgrS htpG citG citX dpiB kdgR citD citF phnE ppc zraP fabA citE phoE fabB phnO phoA phoH phoR phnN phnI psiF phnD fabR phnH phnM phnJ pitB phnG amn phnL sgrR phnF phnK phnC phnP thiQ thiP tbpA Figure: E. coli regulatory network 8 DREAM network inference challenge Network inference challenge: DREAM5 results: Method GENIE31 ANOVerence2 Naive TIGRESS 1 2 Network 1 AUPR AUROC 0.291 0.815 0.245 0.780 0.301 0.782 Huynh-Thu et al., 2010 Kueffner et al., 2012 Network 3 AUPR AUROC 0.093 0.617 0.119 0.671 0.069 0.595 Network 4 AUPR AUROC 0.021 0.518 0.022 0.519 0.020 0.517 Overall 40.28 34.02 31.1 9 Purposes Introduce TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. Assess the impact of the parameters. Test and benchmark TIGRESS on several datasets. 10 Outline 1 Methods Regression-based inference TIGRESS Material 2 Results In silico network results In vitro networks results Undirected case: DREAM4 3 Conclusion 11 Regression-based inference: hypotheses Notations ntf transcription factors (TF), ntg target genes (TG), nexp experiments Expression data: X (nexp × (ntf + ntg )). Xg : expression levels of gene g. XG : expression levels of genes in G. Tg : candidate TFs for gene g. Hypotheses 1 The expression level Xg of a TG g is a function of the expression levels XTg of Tg : Xg = fg (XTg ) + ε. 2 A score sg (t) can be derived from fg , for all t ∈ Tg to assess the probability of the interaction (t, g). 12 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 TF 2 ... TF ntf 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 TF 2 ... TF ntf 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 TF 2 0.97 ... ... TF ntf 0 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 0.23 TF 2 0.97 ... ... ... TF ntf 0 0 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 0.23 0 TF 2 0.97 0.03 ... ... ... ... TF ntf 0 0 0 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 0.23 0 ... 0.11 TF 2 0.97 0.03 ... 0 ... ... ... ... ... ... TFntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 0.23 0 ... 0.11 TF 2 0.97 0.03 ... 0 ... ... ... ... ... ... TF ntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 0.23 0 ... 0.11 TF 2 0.97 0.03 ... 0 ... ... ... ... ... ... TF ntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 0.23 0 ... 0.11 TF 2 0.97 0.03 ... 0 ... ... ... ... ... ... TF ntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 TF 23 → TG 5 TF 2 → TG 1 ... ... ... Threshold to a value or a given number N 3 1 0.99 0.97 ... of edges. 13 Adding a linearity assumption TIGRESS’ first hypothesis: regulations are linear Xg = fg (XTg ) + ε = XTg ω g + ε g Consequence: if ωt = 0, no edge between g and t. 14 Adding a sparsity assumption TIGRESS’ second hypothesis: few TFs regulate each TG g ∀g, ]{ωt 6= 0} << ntf Possible algorithm: LARS with L steps => L TFs selected. 15 Stability Selection Problem: LARS efficiency is limited: bad response to correlation; no confidence score for each TF. Solution: Stability Selection with randomized LARS (Bach, 2008 ; Meinshausen and Bühlmann, 2009): Resample the experiments: run LARS many (e.g. 1, 000) times with different training sets. “Resample” the variables: also weight the variables Xit ← Wt Xit (1) where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more randomized the variables. Get a frequency of selection for each TF. 16 Stability Selection Problem: LARS efficiency is limited: bad response to correlation; no confidence score for each TF. Solution: Stability Selection with randomized LARS (Bach, 2008 ; Meinshausen and Bühlmann, 2009): Resample the experiments: run LARS many (e.g. 1, 000) times with different training sets. “Resample” the variables: also weight the variables Xit ← Wt Xit (1) where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more randomized the variables. Get a frequency of selection for each TF. 16 Stability Selection path 1 Frequency of selection of each TF over the subsamplings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 Number L of LARS steps (example for one target gene) 17 Scoring 1 Frequency of selection of each TF over the subsamplings 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 Number L of LARS steps 18 Scoring Choose L, then: Original scoring Area scoring (contribution) 18 Scoring Let Ht be the rank of TF t. Then, score = E[φ(Ht )] with Original: φ(h) = 1 if h ≤ L , 0 otherwise Area: φ(h) = L + 1 − h if h ≤ L , 0 otherwise => Area scoring takes the value of the rank into account. 18 TIGRESS Summary Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: 2 TG 1 TG 2 TG 3 LARS TF 1 0.23 0 + Stab. Selection TF 2 0.97 0.03 => + Choose L ... ... ... ... + Score TF ntf 0 0 0 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... Threshold to a value or a given number N of edges. 3 ... ... ... ... ... TG ntg 0.11 0 ... 0.76 19 Parameters TIGRESS needs four parameters to be set: scoring method (original, area, ...); number of runs R: large; randomization level α: between 0 and 1; number of LARS steps L: not obvious. 20 Data Network DREAM5 Net 1 (in-silico) DREAM5 Net 3 (E. coli) DREAM5 Net 4 (S. cerevisiae) E. coli Net from Faith et al., 2007 DREAM4 Multifactorial Net 1 DREAM4 Multifactorial Net 2 DREAM4 Multifactorial Net 3 DREAM4 Multifactorial Net 4 DREAM4 Multifactorial Net 5 ] TF 195 334 333 180 100 100 100 100 100 ] Genes 1643 4511 5950 1525 100 100 100 100 100 ] Chips 805 805 536 907 100 100 100 100 100 ] Edges 4012 2066 3940 3812 176 249 195 211 193 21 Outline 1 Methods Regression-based inference TIGRESS Material 2 Results In silico network results In vitro networks results Undirected case: DREAM4 3 Conclusion 22 Impact of the parameters: results on in silico network Area (1,000 runs) L 20 15 15 10 10 5 5 0.1 L 20 0.9 0.1 20 15 10 10 5 5 20 L 0.3 0.5 0.7 Area (4,000 runs) 15 0.1 0.3 0.5 0.7 Area (10,000 runs) 0.1 20 15 10 10 5 5 0.3 0.5 α 0.7 0.9 40 0.1 60 0.3 0.5 0.7 Original (4,000 runs) 0.9 Area less sensitive than original to α and L. Area systematically outperforms original. 0.9 15 0.1 Original (1,000 runs) 20 80 0.3 0.5 0.7 0.9 Original (10,000 runs) 0.3 0.5 α 0.7 The more runs, the better Best values: α = 0.4, L = 2, R = 10, 000. 0.9 100 Overall Score 23 1000 200 How to choose L? 500 0 0 0 0 800 200 100 2 4 6 2 4 6 First 50,000 edges First 100,000 edges 8 8 600 150 0 0 5 10 15 First 100,000 edges 150 100 400 100 200 0 0 50 50 0 10 1020 3020 40 30 50 L=2: number of TFs/TG smaller and more variable. 0 0 50 100 150 L=20: greater number of TFs/TG, less sparsity. => L should depend on the expected network’s topology 24 TIGRESS vs state-of-the-art 1 0.8 0.8 GENIE3 TPR 0.6 Precision 1 ANOVerence 0.4 CLR 0.2 ARACNE Naive TIGRESS 0.6 0.4 0.2 TIGRESS 0 0 0.2 0.4 0.6 FPR 0.8 1 0 0 0.05 0.1 0.15 Recall TIGRESS is competitive with state-of-the-art. 25 Results on E. coli network 1 0.8 0.8 Precision 1 TPR 0.6 GENIE3 0.4 0.6 0.4 CLR ARACNE 0.2 0.2 TIGRESS 0 0 0.5 FPR 1 0 0 0.05 0.1 0.15 Recall Random Forests-based and DREAM winner GENIE3 overperforms all methods. 26 False discovery analysis on E. coli Main false positive pattern found by TIGRESS: Should find Does find Good news: spuriously inferred edges close to true edges Bad news: confusion (due to linear model?) 27 Undirected case: DREAM4 challenge DREAM4: undirected networks (TFs not known in advance) A posteriori comparison of Default TIGRESS and GENIE3: Method GENIE3 TIGRESS Network 1 AUPR 0.154 0.165 AUROC 0.745 0.769 Network 2 AUPR 0.155 0.161 AUROC 0.733 0.717 Network 3 AUPR 0.231 0.233 AUROC 0.775 0.781 Network 4 AUPR 0.208 0.228 AUROC 0.791 0.791 Network 5 AUPR 0.197 0.234 AUROC 0.798 0.764 Overall scores: GENIE3: 37.48 TIGRESS: 38.85 28 Outline 1 Methods Regression-based inference TIGRESS Material 2 Results In silico network results In vitro networks results Undirected case: DREAM4 3 Conclusion 29 Conclusion Contributions: Automatization and adaptation of Stability Selection to GRN inference. Area scoring setting: better results and less elasticity to parameters. Insights on network’s behavior TIGRESS is: Linear Competitive Parallelizable Available (http://cbio.ensmp.fr/tigress) But outperformed in some cases by random forests: limits of linearity? Perspectives: Adaptive/changeable value for L. Group selection of TFs. Use of time series/knock-out/replicates information. 30 Molecular signatures for breast cancer prognosis ACH, Gestraud, P. and Vert, J.-P., The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures, 2011, PLoS ONE ACH, Jacob, L. and Vert, J.-P., Improving Stability and Interpretability of Molecular Signatures, 2010, arXiv 1001.3109 31 Motivation Prediction of breast cancer outcome Assist breast cancer prognosis based on gene expression. Avoid adjuvant/preventive chemotherapy when not needed. Gene expression signature Data: primary site tumor expression arrays. Among the genome, find the few (50-100) genes sufficient to predict metastasis/relapse. Main challenge: high-dimensional data (few samples, many variables). 32 Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33 Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33 Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33 Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33 Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33 Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33 Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33 Contributions 1 Systematic comparison of feature selection methods in terms of: predictive performance; stability; biological stability and interpretability. 2 Group selection of genes: with predefined groups (Graph Lasso); with latent groups (k-overlap norm). 3 Evaluation of Ensemble methods 34 Evaluation Accuracy: how well selected genes + classifier predict metastatic events on test data. Stability: how similar two lists of genes are. Interpretability: how much biological sense selected genes make. 35 Data Four public breast cancer datasets from the same technology (Affymetrix U133A): GEO Reference GSE1456 GSE2034 GSE2990 GSE4922 ] genes 12,065 12,065 12,065 12,065 ] samples 159 286 125 249 ] positives 40 107 49 89 36 Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 37 Classical feature selection/feature ranking methods Type Filters Wrappers Embedded Characteristics Univariate, fast Only depend on the data Do not use loss function Learning machine as a criterion Computationally expensive Learning + selecting Possible use of prior knowledge Examples T-test, KL Divergence Wilcoxon rank-sum test SVM RFE, Greedy FS Lasso, Elastic Net Each of these algorithms returns a ranked list of genes to be thresholded. 38 First results 100 genes over four databases. 0.16 Random Ttest 0.14 Entropy Bhattacharryya 0.12 Wilcoxon SVM RFE Stability 0.1 GFS 0.08 Lasso E−Net 0.06 0.04 0.02 0 0.58 0.59 0.6 0.61 Accuracy 0.62 0.63 0.64 0.65 39 First conclusions 0.16 Random Ttest 0.14 Entropy Bhattacharryya 0.12 Wilcoxon SVM RFE Stability 0.1 GFS 0.08 Elastic Net neither more stable nor more accurate than Lasso. Lasso E−Net 0.06 0.04 Accuracy/Stability trade-off 0.02 0 Random better than tossing a coin. 0.58 0.59 0.6 0.61 Accuracy 0.62 0.63 0.64 0.65 T-test: both simplest and best. Next step: Can we have a better stability without decreasing accuracy? 40 First conclusions 0.16 Random Ttest 0.14 Entropy Bhattacharryya 0.12 Wilcoxon SVM RFE Stability 0.1 GFS 0.08 Elastic Net neither more stable nor more accurate than Lasso. Lasso E−Net 0.06 0.04 Accuracy/Stability trade-off 0.02 0 Random better than tossing a coin. 0.58 0.59 0.6 0.61 Accuracy 0.62 0.63 0.64 0.65 T-test: both simplest and best. Next step: Can we have a better stability without decreasing accuracy? 40 Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 41 Ensemble Methods Run each algorithm R times on subsamples. Get R ranked lists of genes (r b )b=1...R . Aggregate and get a score for each gene: S(g) = R 1X b f (rg ). R b=1 average : f (r ) = (p − r )/p exponential : f (r ) = exp(−αr ) stability selection : f (r ) = δ(r ≤ k ) Sort S by decreasing order and threshold to get final signature 42 Results 100 genes over four databases. Random 0.2 Ttest Entropy 0.18 Bhattacharryya 0.16 Wilcoxon SVM RFE Stability 0.14 GFS 0.12 Lasso E−Net 0.1 Single−run 0.08 Stab. sel 0.06 0.04 0.02 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy 43 Functional stability 100 genes over four databases. Random Ttest 0.3 Entropy Bhattacharryya Functional stability 0.25 Wilcoxon SVM RFE 0.2 GFS Lasso 0.15 E−Net Single−run 0.1 Stab. sel 0.05 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy 44 Ensemble methods: conclusions Random 0.2 Random Ttest Entropy Bhattacharryya 0.16 Bhattacharryya 0.25 Functional stability Wilcoxon SVM RFE 0.14 Stability Ttest 0.3 Entropy 0.18 GFS 0.12 Lasso E−Net 0.1 Single−run 0.08 Wilcoxon SVM RFE 0.2 GFS Lasso 0.15 E−Net Single−run 0.1 Stab. sel Stab. sel 0.06 0.05 0.04 0.02 0 0.57 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.58 0.59 0.6 0.65 Accuracy 0.61 0.62 0.63 0.64 0.65 Accuracy Expected improvement in stability not happening. Slight improvement in accuracy in some cases. Loss in functional stability. T-test: still the preferred method. Next step: Can we do better by incorporating prior knowledge? 45 Ensemble methods: conclusions Random 0.2 Random Ttest Entropy Bhattacharryya 0.16 Bhattacharryya 0.25 Functional stability Wilcoxon SVM RFE 0.14 Stability Ttest 0.3 Entropy 0.18 GFS 0.12 Lasso E−Net 0.1 Single−run 0.08 Wilcoxon SVM RFE 0.2 GFS Lasso 0.15 E−Net Single−run 0.1 Stab. sel Stab. sel 0.06 0.05 0.04 0.02 0 0.57 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.58 0.59 0.6 0.65 Accuracy 0.61 0.62 0.63 0.64 0.65 Accuracy Expected improvement in stability not happening. Slight improvement in accuracy in some cases. Loss in functional stability. T-test: still the preferred method. Next step: Can we do better by incorporating prior knowledge? 45 Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 46 Data Expression data: Van’t Veer et al., 2002; Wang et al., 2005. PPI network with 8141 genes (Chuang et al., 2007) Assumption: genes close on the graph behave similarly Idea: instead of selecting single genes, select edges 47 Selecting groups of genes: `1 methods Lasso : selects single genes (Tibshirani, 1996) Why groups? selecting similar genes: improving stability and interpretability smoothing out noise by "averaging": improving accuracy Group Lasso (Yuan & Lin, 2006): implies group sparsity for groups of covariates that form a partition of {1...p} Overlapping group Lasso (Jacob et al., 2009): selects a union of potentially overlapping groups of covariates (e.g. gene pathways). Graph Lasso: uses groups induced by the graph (e.g. edges) 48 Is group sparsity enough? `1 methods work well, but face serious stability issues when groups are correlated. Solution: randomization through stability selection. 49 Accuracy results Test on data from Wang et al., 2005. Signature learnt on Van’t Veer dataset Balanced Accuracy 0.8 Signature learnt on Wang dataset 0.6 0.4 0.2 0 Lasso Lasso + stab. sel. Graph Lasso Graph Lasso + stab. sel. Neither prior knowledge nor stability selection bring any improvement! 50 Number of genes in the overlap Stability results 7 6 5 Lasso Graph Lasso with stability selection Lasso with stability selection Graph Lasso 4 3 2 1 0 0 20 40 60 80 100 Number of genes in the signatures 120 Graph Lasso slightly improves stability. 51 Interpretability Signature obtained using Lasso: 52 Interpretability Signature obtained using Graph Lasso + Stability Selection: 52 Graph Lasso conclusion Graphical prior seems to increase stability and interpretability. However: no change in accuracy. Next step: Grouping increases stability. Now on to accuracy! 53 Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 54 Latent grouping Grouping genes makes sense. Let the data tell which genes to select together. The k-support norm Introduced by Argyriou et al., 2012 A trade-off between `1 and `2 . Equivalent to overlapping group Lasso (Jacob et al., 2009) with all possible groups of size k . Results in selecting groups that are not predefined. 55 Extreme randomization Following Breiman’s random forests Sample both the examples and the covariates. Less variables = less correlation. Give each gene a chance to be selected. Extreme randomization For each of the R runs: Bootstrap samples (classical Ensemble method) Sample the covariates: randomly choose 10% of them. Run FS procedure on the restricted data. => Compute frequency of selection: P(g selected |g preselected) 56 Accuracy or stability? 0.05 T−test Random 0.045 Ttest 0.04 0.035 Lasso ENet kSupport (k=2) Stability 0.03 kSupport (k=10) 0.025 0.02 kSupport (k=20) Single−run Extreme Rand. + SS 0.015 0.01 0.005 0 0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 Accuracy 57 Stability or redundancy? 0.05 Random 0.045 0.04 Ttest Lasso ENet Stability 0.035 0.03 kSupport (k=2) kSupport (k=10) kSupport (k=20) 0.025 Single−run Extreme Rand. + SS 0.02 0.015 0.01 0.005 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Correlation 58 Conclusions 0.05 0.05 T−test Random Random 0.045 0.045 Ttest 0.04 0.035 Lasso 0.04 0.035 kSupport (k=2) kSupport (k=10) 0.02 Stability Stability 0.03 0.025 Ttest Lasso ENet ENet kSupport (k=20) 0.03 kSupport (k=2) kSupport (k=10) kSupport (k=20) 0.025 Single−run Single−run Extreme Rand. + SS Extreme Rand. + SS 0.02 0.015 0.015 0.01 0.01 0.005 0 0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 0.005 0.01 0.02 0.03 0.04 Accuracy 0.05 0.06 0.07 0.08 Correlation Extreme Randomization improves accuracy. Grouping improves stability. But: both effects do not add up that well (redundancy) T-test still the best trade-off? 59 Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 60 Signature selection: conclusion Contributions: Step-by-step study of FS methods behavior on several breast cancer datasets. Systematic analysis of accuracy, stability, interpretability. Insights on the accuracy/stability trade-off. What have we learned? Best methods: simple t-test or complex black box Grouping improves gene and functional stability. Randomization improves accuracy (sometimes) but has unwanted effects on stability. Accuracy/Stability Trade-off: Stability => redundancy => lower accuracy. 61 Signature selection: perspectives One unique signature? single breast cancer subtype many samples larger signature Is expression data sufficient? probably not all information is there clinical data: same accuracy (same information?) possibly look at genotype, methylation, clinical and expression Is stability important? not as important as accuracy + prediction concordance possibly not even achievable 62 Conclusion 63 Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64 Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64 Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64 Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64 Acknowledgements Fantine Mordelet Paola Vera-Licona Pierre Gestraud Laurent Jacob 65 The k-support norm It can be shown that:  k −r −1 X (|ω|↓i )2 +  Ωsp k (ω) = i=1 1 r +1 d X !2  12 |ω|↓i  i=k −r where r is the only integer in {0, . . . , k − 1} statisfying |ω|↓k −r −1 > d 1 X |ω|↓i ≥ |ω|↓k −r . r +1 i=k −r and |ω|↓i is the i-th largest value of |ω| (|ω|↓0 = +∞). 66 Link with overlapping Group Lasso The k-support norm is equivalent to the overlapping group Lasso norm   X  X Ωsp (ω) = min ||v || : supp(v ) ⊆ I, v = ω I 2 I I k  v ∈Rp×Gk  I∈Gk I∈Gk where Gk denotes all subsets of {1, . . . , d} of cardinality k . Remark 1: it selects at least k variables. Remark 2: the first selected group consists of the k variables most correlated with the response 67 ADMM - applied to k-support problem Our problem 2 minω,β R̂l (ω) + λ2 Ωsp k (β) s.t. ω − β = 0 Augmented Lagrangian ρ 2 0 2 Lρ (ω, β, µ) = R̂l (ω) + λ2 Ωsp k (β) + µ (ω − β) + 2 ||ω − β|| Algorithm 1 2 Initialize: β (1) , ω (1) , µ(1) for t = 1, 2, ..., do o n w (t+1) = arg minw R̂l (w) + µ(t)T w + ρ2 ||w − β (t) ||2 (t) β (t+1) = prox λ Ωsp (.)2 w (t+1) + µρ 2ρ k µ(t+1) = µ(t) + ρ(w (t+1) − β (t+1) ) 68 ADMM - optimality conditions Three first order conditions: Primal condition: ω ∗ − β ∗ = 0 Dual condition 1: ∇R̂l (ω ∗ ) + µ∗ = 0 ∗ 2 ∗ Dual condition 2: 0 ∈ λ2 ∂Ωsp k (β ) + µ Resulting in a definition for the residuals at step t + 1: Primal residuals: r (t+1) = ω (t+1) − β (t+1) Dual residuals: s(t+1) = ρ(β (t+1) − β (t) ) As the algorithm converges, the (norm of the) residuals tend to zero. 69 ADMM - choice of parameter Parameter ρ is critical in ADMM: it controls how much variables change. It can be seen as a step size. How to choose it? Adaptive ADMM One solution the problem:  is to let(t)it adapt to if ||r (t+1) ||2 > η||s(t+1) ||2 and t ≤ tmax  (1 + τ )ρ ρ(t+1) = ρ(t) /(1 + τ ) if η||r (t+1) ||2 < ||s(t+1) ||2 and t ≤ tmax  (t) ρ otherwise In practice, we use τ = 1, η = 10 and tmax = 100. => Adaptive ADMM forces the primal and dual residuals to be of a similar amplitude. 70 Comparison k=5 5 0 0 RDG RDG k=1 5 −5 −5 −10 −15 0 −10 10 Time (Seconds) k = 10 5 ADMM − adaptive ADMM − rho=1 ADMM − rho=10 ADMM − rho=100 FISTA RDG RDG 0 −5 −10 −15 0 −15 0 20 10 Time (Seconds) 20 k = 100 5 0 −5 −10 10 Time (Seconds) 20 −15 0 20 40 Time (Seconds) 60 71 Accuracy vs size of the signature Single−run Ensemble−Mean 0.65 0.6 AUC AUC 0.65 0.55 0.6 0.55 0.5 0.5 0.45 0 0.45 0 20 40 60 80 100 40 60 80 100 0.65 AUC 0.65 AUC 20 Ensemble−Stability Selection Ensemble−Exponential 0.6 0.55 0.6 0.55 0.5 0.5 0.45 0 0.45 0 20 Random 40 T−test 60 Entropy 80 100 Bhatt. Wilcoxon 20 SVM RFE 40 GFS 60 80 Lasso 100 E−Net 72 Stability vs size of the signature Single-run 0.6 0.4 0.2 0 0 20 40 60 80 Stability 0.4 0.2 20 Random 40 T test 60 80 Entropy 0.4 0.2 20 40 60 80 100 Ensemble-stability selection 0.8 0.6 0 0 Ensemble-average 0.6 0 0 100 Ensemble-exponential 0.8 Stability 0.8 Stability Stability 0.8 0.6 0.4 0.2 100 0 0 Bhatt. Wilcoxon 20 RFE 40 GFS 60 80 Lasso 100 E−Net 73