short lectures notes

Data Fusion Tutorial
Jane looks foR help!
[BC]2 Basel, June 9, 2015
jane’s personal hairball!
Hi jane. NAR just published
176 new bio databases*!
Ohhh!
What’s wrong?
How about stiching
them in a single
data table?
I could work this out,
but not for every
different data source
out there.
.... Messy.
Think about
all different
edge types!
I have no idea how to
Make anything useful.
TRMT61A
RecA_monomer-monomer_interface
Tried it. A nightmare!
Think of GO annotationS
in the data table
of yeast phenotypes!
TOP3B
NFX1
PMS1
GBP2
RAD52
RPL11
Homologous Recombination Repair
Double-Strand Break Repair
TP53
POLD4
RAD54B
POLD1
ACACB
RPA1
RPA4
EIF5A
DNA Repair
RAD9B
CRYAB
RPA2
PMS2
POLR2K
RFC5
TERF2IP
MND1
TOP3A
Meiotic Recombination
MUS81
ATR
BIOCARTA_ATM_PATHWAY
MYO18A
RPA3
RFC3
BIOCARTA_ATRBRCA_PATHWAY
UBE2I
MSH3
PRKDC
RAD9A
POLD3
BARD1
FANCD2
RFC1
AIRE
PCNA
WRN
ZNF280B
EME1 DNAJA3
TERF2
CDK2
RFC2
RFC4
Told you!
EXO1
ACACA
MLH1
MLH3
XRCC2
XRCC5
DMC1
DNA_recomb/repair_RecA
MDC1
MRE11A
CSNK1D
RAD51
APEX2
Homologous
recombination
MSH5
DNA_recomb/repair_Rad51_C
FANCL
SEC14L5
RAD51D
C10orf2
POLD2
MSH2
H2AFX
FANCF
CSNK1E
XRCC6
ATM
PPP1CC
JUN
FANCE
CHEK2
FANCA
RAD51AP1
RBBP8
RAD51B
FIGN
TP53BP1
FEN1
CHEK1
SSBP1
UIMC1
Homologous recombination repair of ...
SHFM1
LIG1
MSH5-C6orf26
FANCG
RAD51C
HSPA9
RAD54L
FANCC
MED6
XRCC4
Mismatch repair
COPB2
NBN
BRCA2
BLM
TOPBP1
XRCC3
DNA_recomb_RecA/RadB_ATP-bd
BRCA1
RAD50
MSH6
BRIP1
MSH4
PALB2
Meiosis
BARD1 signaling events
GYS1
Fanconi anemia pathway
CSTF1
FAM175A
ANAPC2
* Fernandez-suarez & galperin, nucleic acids research, 2013.
Large-scale data fusion
by collective matrix factorization
Tutorial at the Basel Computational Biology Conference,
Basel, Switzerland, 2015
These notes include introduction
to integrative data analysis with
examples from collaborative
filtering and systems biology,
and Orange workflows that we
will construct during the tutorial.
Tutorial instructors:
Marinka Zitnik and Blaz Zupan,
with the help from members of
Bioinformatics Lab, Ljubljana.
Welcome to the hands-on Data Fusion Tutorial! This tutorial is designed
for data mining researchers and biologists with interest in data analysis
and large-scale data integration. We will explore latent factor models, a
popular class of approaches that have in recent years seen many
successful applications in integrative data analysis. We will describe the
intuition behind matrix factorization and explain why factorization
approaches are suitable when collectively analyzing many heterogeneous
data sets. To practice data fusion, we will construct visual data fusion
workflows using Orange and its Data Fusion Add-on.
If you haven’t already installed Orange, please follow the installation
guide at http://biolab.github.io/datafusion-installation-guide.
* See http://helikoid.si/recomb14/zitnik-zupan-recomb14.png for our full award-winning poster on data fusion.
1
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 1: Everything is a Matrix
In many data mining applications there are plenty of potentially beneficial data
available. However, these data naturally come in various formats and at different
levels of granularity, can be represented in totally different input data spaces and
typically describe distinct data types.
Gene interaction network
can easily be converted
to a matrix. Each wighted
edge in a network
corresponds to a matrix
entry.
gacT
ga
cT
ge
mA
rdi
A
rac
N
rac
J
rac
I
xa
cA
rac
M
For joint predictive modeling of heterogeneous data we need a generic way to encode
data that might be fundamentally different from each other, both in type and in
structure. An effective way to organize a data compendium is to view each data set a
matrix. Matrices describe dyadic relationships, which are relationships between two
groups of objects. A matrix relates objects in the rows to objects in the columns.
Examples of data matrices commonly used in the analysis of biological data include
degrees of protein-protein interactions from the STRING database that are
represented in a gene-to-gene matrix:
gemA
gacT
rdiA
gemA
racN
rdiA
racJ
racN
racM
racJ
racI
racI
xacA
xacA
racM
Binary matrices can be used to associate Gene ontology terms with cellular
pathways:
alg7
alg13
alg14
alg1
Ontology terms
Part of N-Glycan biosynthesis pathway
Fructose and mannose
metabolism
dpm1
dpm2 dpm3
Protein N-linked
glycosylation (GO:0006487)
Orthology
Dolichol kinase (K00902)
Alpha-mannosidase II (K01231)
Oligosaccharyltransferase complex (K12668)
Ontology
GO:0004168
GO:0004572
GO:0008250
Pathways
Binary relations between
two object types can be
represented with a binary
matrix.
2
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
research articles with Medical Subject Headings (MeSH):
Papers cited in PubMed
are tagged with MeSH
terms. We can use one
large binary matrix to
encode relations
between research articles
and MeSH terms.
Medical Subject Headings
MeSH terms
Literature
Cell separation
Cytoplasmic vesicles/metabolism
Ethidium/metabolism
Immunity/innate
Mutation
Phagocytes/cytology
Phagocytes/immunology*
Phagocytosis*
or membership of genes in pathway, one column for each pathway:
Part of N-Glycan biosynthesis pathway
alg7
alg13
alg14
alg1
alg2
Pathways
Fructose and mannose
metabolism
alg11
dpm1
dpm2 dpm3
alg12
Genes
Just like the relations
between MeSH terms
and scientific papers, we
can encode pathway
memberships of genes in
one large matrix that has
genes in rows, pathways
in commons.
alg3
alg9
GPI-anchor
biosynthesis
The structure of Gene Ontology can be represented with a real-valued matrix
whose elements represent distance or semantic similarity between the
corresponding ontological terms:
Part of Gene Ontology graph
Response to
external biotic
stimulus
Defense
response
Response to
other organisms
Defense
response to
other organism
Response to
bacterium
Defense
response to
bacterium
Gene Ontology terms
Response to
biotic
stimulus
Re
oth spon
s
er
org e to
an
ism
s
Response to
external
stimulus
Re
to spon
str
es se
s
Response
to stress
Gene Ontology terms
Any ontology can be
represented with a
square matrix. We use
ontology to measure
distances between its
entities, and encode
these distances in a
distance matrix.
Response
to bacterium
Defense
response
3
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 2: The Challenge
Suppose we would like to identify genes whose mutants exhibit a certain
phenotype, e.g., genes that are sensitive to Gram negative bacteria. In addition to
current knowledge about phenotypical annotations, i.e. data encoded in a gene-tophenotype matrix, which might be incomplete and contain some erroneous
information, there exists a variety of circumstantial evidence, such as gene
expression data, literature data, annotations of research articles etc.
An obvious question is how to link these seemingly disparate data sets. In many
applications there exists some correspondence between different input dimensions.
For example, genes can be linked to MeSH terms via gene-to-publication and
publication-to-MeSH-term data matrices. This is an important observation, which
we exploit to define a relational structure of the entire data system.
The data excerpt on the
right comes from a gene
prioritization problem
where our goal was to
find candidates for
bacterial response genes
in a social amoeba
Dictyostelium. Other than
for a few seed genes,
there was not any data
from which we could
directly infer the bacterial
phenotype of mutants.
Hence, we considered
circumstantial data sets
and hoped that their
fusion would uncover
interesting new bacterial
response genes.
Gram neg. defective
Aberrant spore color
Aberrant spore color
Decreased chemotaxis
Gram pos.defective
The major challenge for such problems is how to jointly model multiple types of
data heterogeneity in a mutually beneficial way. For example, in the scheme below,
can information about the relatedness of MeSH terms and similarity between
phenotypes from the Phenotype Ontology help us to improve the accuracy of
recognizing Gram negative defective genes?
Phenotype
Ontology
Mutant
Phenotypes
Timepoints
spc3
swp1
kif9
alyL
Genes nagB1
gpi
shkA
nip7
Publications
Pubmed
data
Phenotype
data
MeSH terms
Expression
data
MeSH terms
MeSH
Ontology
MeSH
annotations
4
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 3: Recommender Systems
Sparse matrices and matrix completion have been thoroughly addressed in the area of
machine learning called recommender systems. Several methods from this field form
foundation for matrix-based data fusion. Hence, we diverge here from fusion to
recommender systems, and for a while, from biology to movies.
How would you decide which movie to recommend to a friend? Obviously, a useful
source of information might be ratings of the movies your friend had seen in the
past, i.e. one star up to five stars. Movie recommender systems primarily use user
ratings information from which they estimate correlations between different movies
and similarities between users and infer a prediction model which can be used to
make recommendations about which movie a user should see next.
For example, in the figure below we see a movie ratings data matrix containing
information for four users and four movies. Notice that in a real setting such
matrices can contain information for millions of users and hundreds of thousands of
movies. However, each individual user typically sees only a small proportion of all the
movies and rates even fewer of them. Hence, data matrices in recommender systems
are typically extremely sparse, e.g., it is common that up to ~99% of matrix elements
are unknown. This characteristic together with a strong relational structure of the
data, i.e. “you might enjoy movies, which users similar to you, are enthusiastic about”
and “you might like movies that are similar to the movies you have already seen and
rated favorably.”
The movie rating matrix
has users in rows and
movies in columns. We
made this explicit in this
simple graph: object
types are represented as
nodes (User, Movie) and
an edge is labeled with a
matrix that relates them.
ss
W eng
ar
o ers
Br f th
id e
Th e W Wo
a rl
e
M rs ds
at
rix
Re
lo
ad
e
d
Is there an analogy between recommender systems and challenges in systems biology?
We will answer these questions in the next lessons.
User
Pa
John, Kate, Alex and
Mike rated a selection
from four movies,
Passengers, War of the
Worlds, Bride Wars and
The Matrix Reloaded.
John, for example, has
seen Passengers and
Bride Wars, and did not
like them so much. Which
of the two other movies,
if any, he should see?
John
Kate
Alex
Mike
2
3
5
4
4
4
5
5
Movie
5
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 4: Matrix Factorization and
Completion
Taking our four-by-four movie ratings matrix we can try to factorize it into a product
of two much smaller latent matrices called latent factors. One latent matrix describes
latent profiles of the users and the other matrix contains latent data representation
of the movies.
L1 L2
John 0.2 0
Kate
Pa
Two-factorization of a
user-movie rating matrix
from the previous page.
Factorization rank of 2
was used. Should
factorization rank be the
same for both latent
matrices?
ss
W eng
ar
of ers
Br th
ide e
Th W Wo
a rl
e
M rs ds
at
rix
Re
loa
de
d
For example, each of our four users is described by a latent profile of length two
and similarly, each movie is explained via two latent components, i.e. L1 and L2.
The dimensionality of a latent model is typically called factorization rank.
0 0.5
Alex 0.5 0
Mike
L1 6.3 0 1.1 8
L2 3.9 10.7 0 3.3
0 0.5
Latent model, that is, the
two latent matrices are
complete, hence their
product is also a
complete matrix. This
product (matrix on the
right) is an estimate of an
original matrix (matrix on
the left). How good is our
reconstruction? Which of
the two movies should
be recommend to Mike?
Pa
ss
W eng
ar
of ers
Br th
id e
Th e W Wo
a rl
e
M rs ds
at
rix
Re
lo
ad
e
d
The challenge of matrix factorization stems from the difficulty of estimating the
latent matrices in a way that their matrix product minimizes some measure of
discrepancy between the input data matrix and its reconstruction obtained by
factorization. Importantly, reconstructed matrix is complete, i.e. all of its elements
are defined, which we exploit for making predictions.
John
2
Kate
Alex
3
5
4
Mike 4
1.3
4
5
5
~
~
0 0.2 1.6
2 5.4 0 1.7
3.2
0 0.6 4
2 5.4 0 1.7
6
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 5: Matrix Tri-Factorization
d
ss
W eng
ar
of ers
Br th
id e
Th e W Wo
a rl
e
M rs ds
at
rix
Re
lo
ad
e
U1 U2
John 0.2 0.3
Kate 0.8 0.2
M1 M2
Alex 0.7 1.2
U1 -4.4 9.1
M1 0.9 0.2 0.2 1
Mike 0.8 1.2
U2 6.7 -5.8
M2 0.6 0.8 0.1 0.7
Pa
Similarly to the previous lesson, the goal of matrix ti-factorization is to estimate
three latent matrices that provide a quality approximation of observed entries in
the input data matrix. By selecting sufficiently small factorization rank, we compress
the data, which ensures generalization and consequently prediction of how a given
user would enjoy a particular movie he has not seen before.
Just like for twofactorization, in trifactorization the latent
matrices are complete.
So is their product (the
matrix on the right). This
product is also an
estimate of the original
matrix. How good is our
estimate? Which movie
should be recommended
to Mike?
Pa
ss
W eng
ar
of ers
Br th
id e
Th e W Wo
a rl
e
M rs ds
at
rix
Re
lo
ad
e
d
The backbone matrix (a
2x2 matrix in the middle)
could be seen as a
compressed version of
original user-movies
rating matrix. It has
“meta” users in rows and
“meta” movies in
columns. We can use the
two recipe matrices (left
and right matrix) to
transform the backbone
matrix back to the
original user-movies
space.
So far, we found a decomposition of the movie ratings matrix into two latent
matrices. An alternative approach is to factorize it into three latent matrices; one
latent matrix that expresses the degrees of user membership to each of the latent
components, i.e. user recipe matrix; another latent matrix with memberships of movies
to movie-specific latent components, i.e. movie recipe matrix. A third matrix, i.e. a
backbone matrix, captures the interactions between latent components specific to the
users, i.e. U1, U2, and components specific to the movies, i.e. M1, M2.
John
2
Kate
Alex
3
5
4
Mike 4
1.1 0.3 0.2 1.2
4
5
5
~
~
1.7 4.5 0.2 2.1
4.1 0.5 0.9 4.5
1.5 4.8 0.1 1.9
7
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 6: Tri-Factorization in Orange
In Orange workflows its
components (widgets)
load or process data and
pass the information to
other widgets. Widgets
inputs are on its left, and
outputs on its right. Try
adding a Data Table
widget to display an
input data set and any of
the latent factors!
Let's try matrix tri-factorization in practice. We construct a visual workflow in
Orange, a data mining suite. The workflow loads movie ratings, represents them
with a data matrix, tri-factorizes it and explores the latent factors.
In this tutorial we organize the data sets using a structure that we call a data fusion
graph. It shows the relational structure of entire data collection. Each distinct
type of objects, e.g., users, movies, is represented with a node and each data set
corresponds to an edge that relates two types of objects, e.g., movie ratings data
relate users with the movies.
In Latent Factors widget
one can select any of the
latent matrices and then
explore them further, say,
through hierarchical
clustering.
8
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 7: Collective Matrix Factorization
and Sharing of Latent Factors
Orange workflow on this
page adds another data
source: movie genres.
How does that effect the
results of the movie
clustering?
In the previous lesson we analyzed a single data set. Ultimately, we would like to
collectively tri-factorize many heterogeneous data sets across different input spaces.
Suppose we have collected information about movie genres. This is a relation that
relates movies to genres, hence our data fusion graph gets an additional node, i.e.
genres, and an edge linking movies with genres.
To fuse heterogeneous data at large scales we need to define the kind of knowledge
that can be transferred between related data matrices, types of objects and prediction
tasks. Data fusion algorithms typically rely on one of the following three
assumptions:
Relation transfer: We build the relational map called a data fusion graph of all
the relations considered in data fusion and relax the assumptions about
independently and identically distributed relations.
Object type transfer: We assume that there exists a common feature space
shared by the input spaces, which can be used as a bridge to transfer
knowledge.
Parameter transfer: We make use of latent model parameterization and
assume that heterogeneous input spaces have shared latent parameters and
hyperparameters.
In collective matrix factorization we achieve data fusion by sharing latent matrices
across related data sets.
In our running example we reuse the movie recipe matrix in both decompositions
of user-to-movie as well as movie-to-genre matrices. Importantly, collective matrix
factorization estimates the latent matrices for all data sets in a compendium
simultaneously, which ensures transfer of knowledge between data, i.e. data fusion,
and presents many unique opportunities from the application perspective and
challenges in algorithmic design.
9
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 8: More Complex Fusion Schemes,
Data Sampling and Completion Scoring
So far we fused at most two data sets. Let’s proceed by constructing a larger data
compendium. There are many other sources of information that might be
informative for movie recommendation, for example, user demographics profiles,
movie casting, information about movie directors and screenplays, scenery, etc.
We construct an Orange workflow that considers four data sources, i.e. movie
ratings, movie casting, genres and relationships between actors, and fuse them via
collective matrix factorization.
This data fusion
configuration is already a
complex one. We are
using four different data
sources. Try having a
Fusion Graph widget
window open, so that
you can see the data
fusion schema as it
shapes up when adding
the data sets.
A simple way to assess the benefits of integrative data analysis over the analysis of a
single homogeneous data set is to measure the quality of predictions made by data
fusion versus the quality of prediction model inferred from a part of data
collection.
The assessment is fair if we evaluate predictions for data that are hidden from the
algorithm during prediction model inference. There are four different ways of
partitioning a data matrix into a training and a test set:
10
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
In predictive modeling tasks, such as movie recommendation, where we regress
against the target variable, i.e. movie rating, we can evaluate model quality by
reporting a variety of measures, including the root mean squared error (RMSE). A
lower RMSE value indicate a better model. Alternatively, if our goal would be to
rank the movies from what the model believes are the most enjoyable to the least
enjoyable for a given user, we would use the area under curve (AUC).
How does the quality of
reconstruction change
when adding or
removing data sets from
the fusion schema? Try it
out! Should RMSE
always decrease with
new data sources being
added?
11
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 9: “Meta Genes” - Latent Profiling
Until now we focused on non-biological data. We now apply a latent factor model
to gene expression data. The microarray data for this example is from an influential
paper by DeRisi, Iyer, and Brown (Science 1997), who explored the metabolic and
genetic control of gene expression on a genomic scale. The authors used DNA
microarrays to study temporal gene expression of almost all genes in baker’s yeast
Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration.
Expression levels were measured at seven time points during the diauxic shift, e.g.
T1 to T7.
Experiments
V4
T1 T2 T3 T4 T5 T6 T7
V3
V2
Genes
Gene expression
V5
V1
T1
T2
T3
T4
T5
T6
T7
Timepoints
As we will see in this and in the next lessons, collective matrix factorization is a
generic and flexible tool for integrative data analysis in different domains, e.g.,
recommender systems and functional genomics.
What is similar between
matrix-based movie
recommendation system
and data fusion in
molecular biology?
Everything! We’ll use the
same set of Orange
widgets for bio data
fusion. All tricks that we
have learned so far
apply.
We construct an Orange workflow that reads the expression data into Orange using
Table to Relation widget, tri-factorizes the data, and explores the estimated latent
data representation using various Orange widgets, such as Linear Projection,
Scatter Plot and Multi-dimensional Scaling (MDS).
12
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
By factorizing gene expression data we obtained three latent data matrices: a gene
recipe matrix, an experiment recipe matrix, and a backbone matrix that relates
both recipe matrices in the latent space.
It is common in matrix factorization algorithms to interpret the experiment recipe
matrix as a matrix that reports on expression of “meta genes,” i.e. “genes,” whose
profiles are obtained from the original gene expression profiles by a linear (or, nonlinear, depending on a latent factor model) transformation. Similarly, one can see
the gene recipe matrix as a matrix that reports on expression of genes in “meta
experiments,” i.e. “experiments,” which cannot be interpreted in an intuitive
manner but which can improve the quality of prediction models applied to them,
e.g., clustering of genes based on their recipe matrix and enrichment analysis of
detected clusters.
Combination of
hierarchical clustering
and GO enrichment
analysis is a cool way to
explore results of data
fusion. Genes in the data
set we are exploring are
also function-labeled.
Any other ideas how to
use latent matrices?
Classification, perhaps?
And then estimation of
AUC in cross-validation?
13
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 10: The Yeast Case Study
Next, we collectively analyze eight data sets from molecular biology of yeast S.
cervisiae (load the data sets from http://bit.ly/1Gb8SJ7). We organize them in a data
fusion graph with six object types and eight edges, one for each data set.
This schema looks
boring. But it offers so
much for the patient one!
Try adding matrix
sampling and RMSEbased evaluation! Or
clustering with gene set
enrichment. Or data
projection based on any
of the latent matrices.
14
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 11: Latent Matrix Chaining
The concept of chaining latent matrices is important because it allows us to profile
objects in the latent space of any other object type based on the connectivity in the
data fusion graph.
In the simplest scenario, where object types are adjacent in the fusion graph, e.g.,
“Genes” and “Experiments” from Lesson 9, chaining construct data profiles of one
object type, e.g., genes, in the latent space of another object type, e.g., experiments,
by multiplying the recipe matrix of the first object type by the backbone matrix of
the data set. The resulting profile matrix has objects of the first type, e.g., genes, in
rows and the latent components of the second type, e.g., experiments, in columns.
However, the power of chaining becomes apparent when we would like to profile
objects whose types are not direct neighbors in the fusion graph, such as “Genes” and
“Literature Topics,” i.e. MeSH terms, in the fusion graph from Lesson 10. To profile
genes in the latent space of literature topics chaining starts with the recipe matrix
of genes and multiplies it by backbone matrices of gene-to-literature and literatureto-literature-topic data sets on the path from “Genes” to “Literature Topics” in the
fusion graph. This procedure yields profiles of genes in the latent space of literature
topics.
Latent matrix chaining constructs dense profiles that include the most informative
features obtained by collectively compressing data via matrix factorization.
Intuitively, chaining is able to establish links between genes and literature topics
even though relationships between these object types are not available in input
data.
A conceptual presentation
of profiling of genes in the
latent space of MeSH
terms. The MeSH-based
gene profiles are
constructed by multiplying
latent factors on the path
from one to another object
type.
Topics of yeast biology
Literature
Topic
G
en
e
lit
er
at
ur
e
Literature
Chain of
latent matrices
Gene
Gene
Gene
Literature topic
=
x
Literature topic
x
=
Gene profile
matrix
15
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
In Orange we chain the latent matrices of a data system using the Chaining widget.
Chaining widget allows us to select a start object type and a target object type
(highlighted in orange below) in the latent fusion graph. It then computes the
chains associated with selected nodes from the fusion graph. The so obtained
profile matrices can be used for further data analysis.
Chaining and
construction of relations
for objects that were
originally not related in
any input data set comes
as an extra benefit of
matrix-based data fusion.
Try exploring the
chaining results by
feeding them into a data
table first, and then push
them through
unsupervised or
supervised analysis
pipeline.
16
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 12: Case Studies in Data Fusion
Identification of the
mechanisms of action of
chemical compounds is a
crucial task in drug
discovery. We have
integrated 6 data sets to
improve prediction
pharmacologic actions of
chemical compounds (IEEE
TPAMI 2015).
2
Pharmacologic
Action
Θ1
1
R12
R13
3
PMID
Chemical
R14
R15
4
6
Depositor
R46
Depositor
Category
5
Substructure
Fingerprint
Root layer
We have fused 11 systemslevel molecular data sets to
predict disease-disease
associations (Sci Reports
2013).
...
gastric lymphoma
crescentic glomerulonephritis
...
146
Hodgkin’s lymphoma
Cushing’s syndrome
Largest
disease class
Level 1
cancer
inherited metabolic disorders
nervous system diseases
respiratory system diseases
cardiovascular system disease
51
bile duct disease
hemolytic-uremic syndrome
Level 2
immune system diseases
cognitive disorders
acquired metabolic diseases
metabolic diseases
cancer
18
pulpitis
periodontitis
Level 3
...
Disease class size:
a single disease
two diseases
three or more diseases
18
6
abetalipoproteinemia,
lung metastasis
dysgerminoma
serous cystadenoma
factor XIII deficiency
Plasmodium falciparum malaria
eighteen diseases
17
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
5
Data fusion of 11 data sets
substantial raised the
accuracy of gene function
predictions, also when
compared to kernel-based
data integration approach
(IEEE TPAMI 2015).
4
R45
5
MeSH
Descriptor
PMID
IEEE TRANSACTIONS ON PATTERN ANALYSIS
AND MACHINE INTELLIGENCE, VOL. X, NO. X, OCTOBER 2013
11
R14
TABLE 1
R42
3
Cross-validated F11 and AUC accuracy
scores for fusion by matrix factorization (DFMF), kernel-based method
(MKL), random forests
(RF) and relational learning-based matrix factorization (tri-SPMF).
Experimental
R13
1
Prediction task
Condition
2
DFMF
F12 AUC
Gene
R12
100 D. discoideum genes
1000 D. discoideum genes
R16
Whole D. discoideum genome
Pharmacologic actions
F1
0.799
0.801
GO
Term
R62
6
0.826
0.831
0.663
6
MKL
AUC
0.781
0.787
0.800
0.639
0.823
0.849
0.834
RF
0.788
0.798
0.821
0.811
F1
AUC
0.761
0.767
0.782
0.643
0.785
0.788
0.801
0.819
tri-SPMF
F1
AUC
0.731
0.756
0.778
0.641
0.724
0.741
0.787
0.810
TABLE 2
KEGG
Gene Ontology term-specific cross-validated
F1 and AUC accuracy scores for fusion by matrix factorization
Pathway
(DFMF), kernel-based method (MKL), random forests (RF) and relational learning-based matrix factorization
(tri-SPMF). Terms in Gene Ontology belong to one of three namespaces, biological process (BP), molecular
function (MF) or cellular component.
GO term name
Term identifier Namespace Size
0007190
0006935
0043327
0006909
0009617
0016337
0003779
0003796
0003700
1.0
0.835
0.8
0.830
Averaged F1 score
Prioritization of genes in
a quest to identify the
most promising
candidates for bacterial
response in
Dictyostelium fused 13
input data sets. Out of 9
top-rated candidates, 8
predictions were
confirmed in the wet lab
(submitted, 2015).
Averaged F1 score
Activation of adeny. cyc. act.
Chemotaxis
Chemotaxis to cAMP
Phagocytosis
Response to bacterium
Cell-cell adhesion
Actin binding
Lysozyme activity
Seq.-spec. DNA bind. t. f. a.
(a)0.6
A
0.4
0.0
Tri-factorization
R12 ,
R12
A
1
A-B
Rof12matrix
, R13R
12 , R13 ,
1
11
58
21
33
51
14
43
4
79
0.834
0.981
0.922
0.956
0.899
0.883
0.676
0.782
0.956
Step 1. Compressive
data fusion
0.820
Data fusion
graph
A
0.781
0.786
0.862
0.901
0.761
0.856
0.658
0.750
0.901
RF
AUC
0.758
0.538
0.798
0.789
0.785
0.728
0.642
0.754
0.732
0.601
0.724
0.767
0.619
0.761
0.725
0.737
0.625
0.759
tri-SPMF
F1
AUC
0.729
0.804
0.838
0.836
0.817
0.799
0.671
0.747
0.892
0.731
0.810
0.815
0.810
0.831
0.834
0.682
0.625
0.852
E
B
5.4 BMatrix Factor Initialization
Study
We studied the effect of matrix factor initialization on
DFMF by observing the reconstruction error after one
and after twenty iterations of optimization procedure,
the latter being
about one fourth of the iterations
Collective matrix
D
C
requiredC forfactorization
the optimization
algorithm to converge
when predicting gene functions. We estimated the
error relative to the optimal (k1 , k2 , . . . , k6 )-rank approximation given by the SVD.
For iteration v and
F
G
G
matrix Rij the error was computed by:
0.810
0.0
0.2
0.4
0.6
0.8
1.0
Proportion of 1 included in the model
(b)
D
Fig. 5. Adding
newB data sources (a) or incorporating
Backbone
Recipe
matrix of
of A
A-B
more matrix
object-type-specific
constraints in ⇥1 (b) both
increase the accuracy
of
matrix factorization-based
Recipe
F
matrix of B
models for gene function
prediction task.
A
0.770
0.794
0.835
0.892
0.788
0.867
0.664
0.774
0.894
F1
data representation
E
(a)
0.844
0.980
0.910
0.932
0.870
0.861
0.781
0.750
0.948
MKL
F1
AUC
to held-out constraints to zero so that they did not
affect the cost function during optimization. Fig. 5b
shows that including additional information on genes
in the form of constraints
improves the predictive
(c)
performance of DFMF
for gene function
prediction.
Fused latent
A
0.825(b)
0.815
B
0.2
BP
BP
BP
BP
BP
BP
MF
MF
MF
DFMF
F1
AUC
Errij (v) =
||Rij
(v) (v)
(v) ||2
Gi Sij (GT
j )
dF (Rij , [Rij ]k )
, (14)
dF (Rij , [Rij ]k )
function
prediction
in
x
x
=
~
~ Fig. 5a, where we started with
Data set relating
only the target data
source
R
and
then
added
either
B
A
Reconstructed
(v)
(v)
(v)
12
objects of type A
matrix A-B
where Gi , Gj and Sij were matrix factors obof type B
R13 or ⇥1 or both. Similar effects wereto objects
observed
tained
after
v
iterations
of factorization algorithm.
Step 3. Similarity
Step 2. Object profiling by
when
we
studied other combinations of data sources
estimation
chaining
of latent
matrices
In Eq.
(14), dF (Rij , [Rij ]k ) = ||Rij
Uk ⌃k VkT ||2
(not shown here for brevity). Notice also that due to
denotes
the Frobenius distance between Rij and its
(d)
(f)
(e)
Chains
latent
starting at A
ensembling the cross-validated variance
ofof F
is small.
1 matrices
Seed by
genes the SVD, where k =
k-rank approximation given
A
i
i
A
max(ki , kj ) is the approximation
rank. Errij (v) is a
ii
Chains
Candidate
5.3 Sensitivity to Inclusion of Constraints
iii
pessimistic
measure of quantitative
accuracy because
gene
iv
A
B
ii
v
We varied the sparseness of gene constraint matrix ⇥1 of the choice
of k. This
error measure
is similar
vi
Similarity scoring
Similarity score
Scored
aggregation
vii
candidate gene
by holding out
a random subset ofiii protein-protein
E
B to the error of the two-factor non-negative matrix
A
G inChain of
viii
latent matrices
Seed
ix
teractions.
We
set
the
entries
of
⇥
that
corresponded
factorization
from
[17].
1
genes
A
C
D
C
F
iv
A
Object type
D
F
v
A
D
vi
A
D
vii
A
E
viii
A
D
F
ix
A
E
G
F
Similarity score matrix
B
C
Step 4. Gene
ranking
(g)
Profiling of objects of type A in the latent space of C
A
C = A
A
C =
D
x
F
x
C
x
=
Profile
matrix
A
C
Candidate
genes
Similarity scoring
Similarity score
aggregation
Scored
candidate genes
Similarity score matrices
18
Data Fusion Tutorial
In drug toxicity prediction
the task was to distinguish
between compounds that
represent little or no health
concern and those with the
greatest likelihood to cause
adverse effects in humans
(CAMDA 2013). Highthroughput and
toxicogenomic screening
coupled with a plethora of
circumstantial evidence
provide a challenge for
improved toxicity
prediction and require
appropriate computational
methods that integrate
various biological, chemical
and toxicological data.
Fusion of 29 data sets
allowed us to improve
prediction accuracy well
above that achieved by
standard supervised
approaches (Sys Biomed
2014).
12
9
Hematology,
biochemistry,
liver weight
Sample
metadata
R6;12
R5;12
R5;9
[BC]2 Basel, June 9, 2015
R7;9
R6;9
R8;9
11
Drug type
6
Sample from
rat in vivo
repeated dose
study
5
Sample from
rat in vivo
single dose
study
R1,5
Gene from
rat in vivo
single dose
study
R7,10
R8,10
10
R2,6
Drug
Θ
R4,10
R2,10
1,1
R3,10
2
Gene from
rat in vivo
repeated dose
study
R1,13
R4,8
R3,7
Θ10,10
R1,10
1
8
Sample from
human in
vitro study
R10;11
R6,10
R5,10
7
Sample from
rat in
vitro study
3
Θ2,2
R2,13
R10;14
14
4
Gene from
human in
vitro study
Gene from
rat in vitro
study
Θ4,4
Θ3,3
R3,13
R4,13
DILI
potential
13
GO term
Θ13,13
19
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 13: Related Work on Data Fusion
Lanckriet, Gert R.G., et al. A statistical framework for genomic data fusion.
Bioinformatics 20.16 (2004): 2626-2635. The first study to propose a kernel-based
integration as a way of intermediate data integration.
Schadt, Eric E., et al. An integrative genomics approach to infer causal associations
between gene expression and disease. Nature Genetics 37.7 (2005): 710-717. This study
integrated DNA variation and gene expression data to identify drivers of complex traits.
Aerts, Stein, et al. Gene prioritization through genomic data fusion. Nature
Biotechnology 24.5 (2006): 537-544. The paper describes Endeavour, a tool to prioritize
candidate genes underlying biological processes or diseases, based on their similarity to
known genes involved in these phenomena.
Mostafavi, Sara, et al. GeneMANIA: a real-time multiple association network
integration algorithm for predicting gene function. Genome Biology 9.Suppl 1 (2008):
S4. GeneMANIA is a tool that integrates multiple functional association networks and
predicts gene functions using label propagation.
Zitnik, Marinka, et al. Discovering disease-disease associations by fusing systemslevel molecular data. Scientific Reports 3 (2013). A study of relationships between
diseases based on evidence from fusing available molecular interaction and ontology data.
Wang, Bo, et al. Similarity network fusion for aggregating data types on a genomic
scale. Nature Methods 11.3 (2014): 333-337. Fusion of cancer patient similarity networks
by combining mRNA expression, DNA methylation and microRNA expression data.
Zitnik, Marinka, and Zupan, Blaz. Matrix factorization-based data fusion for druginduced liver injury prediction. Systems Biomedicine 2.1 (2014): 16-22. An application of
a data fusion approach for prediction of drug toxicity in humans using 29 data sets
provided by the CAMDA 2013 Challenge.
Ritchie, Marylyn D., et al. Methods of integrating data to uncover genotypephenotype interactions. Nature Reviews Genetics 16.2 (2015): 85-97. This review
explores emerging approaches for data integration including multi-staged, metadimensional and factor analysis.
Zitnik, Marinka, and Zupan, Blaz. Data fusion by matrix factorization. IEEE
Transactions on Pattern Analysis and Machine Intelligence 37.1 (2015): 41-53. An
introduction and formalization of collective matrix factorization as presented in this
tutorial. The paper also provides mathematical derivation of optimization approach.
20
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 14: Related Tools for Data Fusion
21
Data Fusion Tutorial
[BC]2 Basel, June 9, 2015
Lesson 15: Data Fusion in Python
We have developed a scripting library in Python, which implements collective
matrix factorization and completion, and is suitable for fusion of large data
compendia.
The official source code repository is at http://github.com/marinkaz/scikit-fusion.
22