HERE - BOD ULisboa - Universidade de Lisboa

April 10TH 2015
Universidade de Lisboa
Book of Abstracts
Editors
Catarina Martins, Universidade de Lisboa
Daniela Oliveira, Universidade de Lisboa
Joana Barros, Universidade de Lisboa
Technical Data
Title: Book of Abstracts
Editors: Catarina Martins, Daniela Oliveira and Joana Barros
Year: 2015
Published in electronic format at: bod2015.ciencias.ulisboa.pt
2
Table of Contents
Preface .................................................................................................................................................. 4
Lecture .................................................................................................................................................. 5
Metagenomics closing the gap ........................................................................................................... 6
Presentations ........................................................................................................................................ 7
A role for the carboxyl-terminal domain of RNA Pol II in pre-mRNA splicing ...................................... 8
Genome-wide mapping of RNA Polymerase II CTD modifications with single-nucleotide resolution.. 9
Do Parasites Count the Time? RNA-sequencing the Circadian Transcriptome of T. brucei ................ 10
Detecting footprints of HIV-1 derived sncRNAs in small-RNA-seq data ............................................ 11
Recognition and Normalization of Biomedical Entities Based on Ontologies .................................... 12
Posters ................................................................................................................................................ 13
Compound Matching of Biomedical Ontologies ............................................................................... 14
Data Mining Analysis of Lung Cancer Electronic Health Records ...................................................... 15
Detection of alternative miRNA Processing using the IsomIR_window ............................................ 16
Genetics of Familial Paget’s Disease of Bone.................................................................................... 17
Integrated in silico analysis of the neuronal transcriptome of SMA disease models: untangling the
gene regulatory networks underlying motor neuron degeneration at the cell- and systems-level ... 18
Mining cardiac side-effects for known drugs .................................................................................... 19
Modelling drug response for closely related CNS GPCR proteins ...................................................... 20
Resist: An Intelligent System to Predict Antibiotic Resistance .......................................................... 21
VetBDI – Development of an Integrated Database in Veterinary Medicine ...................................... 22
Visualizing the incoherencies in Bioportal ........................................................................................ 23
3
Preface
This Book of Abstracts refers to the 4th edition of Bioinformatics Open Days of 2015 that took
place at Faculty of Sciences at the University of Lisbon in the 10th of April.
Bioinformatics Open Days is a student-led initiative first held at the Universidade do Minho,
Braga in 2012. It aims to promote the exchange of knowledge between students, teachers and
researchers from the Bioinformatics and Computational Biology fields. This symposium’s 4th
edition is a joint collaboration with the Bioinformatics and Computational Biology Master
students of Universidade de Lisboa.
The accepted submissions aimed to promote the bioinformatics field in our country by doing a
short oral communication or a poster. Each article was reviewed by one or two members of our
scientific committee. We had 16 submissions (6 oral communications and 10 posters) and a
lecture by Ana Teresa Freitas from Instituto Superior Técnico. The abstract of this lecture is also
included in the book of abstracts.
We would like to thank all who collaborated in organizing this event, especially Professor
Francisco Couto and Professor Cátia Pesquita for their guidance and support, our reviewers for
their effort and careful analysis of the submitted abstracts, the Faculty of Sciences and its
Student Association for providing and aiding in several logistic matters. We would also like to
thank all of our guests for accepting to be a part of this project and, finally, we appreciate your
interest in this initiative and hope you liked the program and subjects discussed.
Lisbon, April 2015
Catarina Martins, Daniela Oliveira and BOD organizing committee
4
Lecture
5
Metagenomics: closing the gap
Ana Teresa Freitas
Microorganisms constitute the vast majority of life forms on Earth. The diversity of
microorganisms shows the existence of an astonish set of functions that are necessary
for all other life forms to exist. So it seems that microbes run the world.
Most of the microbiological research is focused on organisms that are cultured on
laboratory, limiting our understanding of the true magnitude of diversity within
microbial communities. For example, the idea that amniotic fluid is sterile has been a
fundamental tenet in obstetrics since the early 1900s. Nowadays, it is known that
healthy amniotic fluid is not as sterile as previously thought. Findings like this one are
only possible due to metagenomics studies that offer a window on an enormous
unknown world of microorganisms.
Due to the presence of microbes in all walks of human life, there is a constant interaction
between microbes and humans. Understanding human-associated communities, known
as Human microbiome, is one of the major frontiers of research in human health. In this
stage, metagenomics is an instrumental tool to close the genotype-phenotype gap in
human disease.
Recent advances in DNA sequencing technologies lead to a huge cost reduction, being
now possible and financially viable to sequence environmental samples. However,
metagenomics still presents many challenges that come mainly from data analysis. DNA
extracted from environmental samples is a mixture of genomes, which brings complexity
to the assembly of the reads in order to obtain the complete genes, operons and
especially individual genomes that make up the sample. In addition, the enormous
amount of information collectively generated by metagenomics approaches poses new
challenges for processing and data analysis. The petabyte and even exabyte scale of this
'big data' generation requires the introduction of advanced parallel computing and other
high-performance computing approaches (HPC), such as cloud infrastructures,
strategies that effectively enable the data exploitation.
In recent years, several computational tools have been developed in order to support
the analysis of metagenomes to quantify community structure and diversity, assemble
novel genomes, identify new taxa and genes, and determine which metabolic pathways
are encoded in the community.
In this talk we will go through the Human Microbiome Project, discussing the role of
metagenomics in clinical diagnosis and presenting its computational challenges.
6
Presentations
7
A role for the carboxyl-terminal domain of RNA Pol II
in pre-mRNA splicing
Joana Tavares, Noélia Custódio and M.Carmo Fonseca
In mammals protein-encoding transcripts are formed by short exons separated by much longer
introns and it is still not fully understood how the small exons separated by long introns are
correctly juxtaposed for splicing. According to a recent model, the carboxyl-terminal domain
(CTD) of RNA polymerase II (RNA Pol II) serves as a platform that tethers the upstream exon until
it is ligated to the downstream exon in the nascent pre-mRNA. To test this model, we generated
cell lines with mutant versions of the CTD. The transcriptome (polyA + RNA) of cells containing
either wild-type, point mutations or deletions of the CTD was analyzed by RNA-seq. Genes that
were either differentially expressed or differentially spliced were identified using two
bioinformatics tools, MISO and MATS. The results show that distinct types of CTD mutations
have specific effects on splicing profiles. We conclude that the composition of the CTD is critical
for determining splicing decisions.
8
Genome-wide mapping of RNA Polymerase II CTD
modifications with single-nucleotide resolution
Tomás Gomes, Ana Rita Grosso and Maria Carmo-Fonseca
RNA Polymerase II (Pol II) is responsible for the production of most RNA molecules by
transcribing the information encoded in the DNA. The RNA is not readily used by the cell, being
first subject to several processing events, such as 5' capping, splicing or 3' processing. There has
been increasing evidence that these mechanisms also act during transcription, not just after the
production of the full molecule. Pol II C-terminal domain (CTD) coordinates interactions between
these processes' machinery and the polymerase itself by being subject to various modifications
on its conserved repeats. However, the relationship between different modifications and cotranscriptional processing events is not fully understood. To approach this, the Proudfoot lab
developed a new protocol for Native Elongating Transcript sequencing (NET-seq) that targets
different human Pol II CTD isoforms, allowing them to be mapped in the genome with singlenucleotide resolution (Nojima, Gomes et al 2015).
Due to the novelty and specificities of this protocol, a workflow for analyzing these datasets was
not well established. In this work such pipeline was defined, including critical steps to deal with
partial adapter contamination and to acquire the Pol II single-nucleotide resolution mapping.
Repetitive sequences, in particular from snRNA and snoRNA, were highly represented in the
sequenced data, and had to be taken into consideration in further downstream analysis
regarding average gene profiles for Pol II CTD isoforms. These profiles showed previously
described Serine 2-phosphorylated Pol II pausing after the 3' end, as well as unphosphorylated
Pol II in the TSS. Surprisingly, Serine 5-phosphorylated Pol II was not seen to accumulate at the
TSS as expected.
An approach based on the NET-seq data was also developed to determine which exons were
spliced during transcription. Results show that co-transcriptional splicing is correlated with
Serine 5 phosphorylation in the CTD repeats, an isoform that was not associated with the
elongation phase of transcription previously.
9
Do Parasites Count the Time? RNA-sequencing the
Circadian Transcriptome of T. brucei
Filipa Rijo Ferreira, Daniel Pinto Neves, Joseph S. Takahashi and Luisa M. Figueiredo
Circadian rhythms are cyclic biological processes with a period of approximately 24 hours that
allow organisms to anticipate regular changes in their environment, such as the changes in light
and temperature from day to night. These rhythms have been observed across all kingdoms of
life, including mammals, bacteria, fungi and plants. Moreover, the divergent molecular
mechanisms underlying these processes suggest their independent evolution in different
kingdoms.
African Trypanosomiasis is a major neglected tropical disease that affects both humans and
animals, threatening over 60 million people in some regions of sub-Saharan Africa and with an
incidence of 7.000 new cases per year. It is caused by Trypanosoma brucei, a unicellular
protozoan parasite whose life-cycle is entirely extracellular both in the mammalian host and
insect vector. Upon the parasite’s invasion of the central nervous system, the disease causes an
array of neurological disorders, including the disruption of the host’s sleep cycle.
In this work we used established protocols and algorithms in the circadian rhythm field to
determine if the parasite itself has an endogenous, entrainable circadian rhythm. We used RNAseq to obtain temporal transcriptome data of two life-cycle stages of the parasite in culture,
using two different entrainment stimuli to synchronize their circadian clock. We found that, in
free-running, in the absence of external stimuli, approximately 12% of T. brucei genes are
expressed in a time-dependent cyclic manner with a period of approximately 24 hours.
Additionally, these genes cluster into two distinct phases of maximal expression. Functional
annotation of these genes indicates their involvement in a wide range of cellular processes.
Overall these results strongly suggest the existence of a molecular clock regulating gene
expression in T. brucei.
10
Detecting footprints of HIV-1 derived sncRNAs in
small-RNA-seq data
Andreia Amaral, Paula Matoso, Rui Soares, Russel Foxall, Ana Sousa and Margarida
Gama-Carvalho
It has become widely accepted that RNA viruses do not encode miRNAs, to avoid unproductive
cleavage of their genomes or mRNAs1. However a recent study, demonstrated that a retrovirus,
the bovine leukemia virus (BLV)2 , encodes a conserved cluster of miRNAs which are transcribed
by pol III escaping mRNA cleavage because only subgenomic pol III transcripts are processed into
miRNAs. Similarly, we hypothesized that HIV-1 could potentially encode non-canonical miRNAlike molecules that cannot be predicted using miRNA prediction algorithms or have not yet been
detected due to the lack of the appropriate experimental context. Therefore, in this study we
have sought to investigate whether HIV-1 derived small non-coding RNAs (sncRNAs) could be
observed in stimulated CD4+ T-cells infected with HIV-1 in conditions mimicking physiological
infection, using high-throughput sequencing.
Similarly to previous studies, the reads with high homology to HIV-1 genome corresponded to
0.11% of the total dataset but in physiological conditions of infection with ~10% of cells infected,
meaning that the proportion of HIV-1 encoded sncRNAs in infected cells would be 1.1%, which
is the highest proportion ever reported. We further hypothesized that by modeling the
distribution of reads along the HIV-1 genome we could identify regions with na accumulation of
reads higher than that expected if they were simply derived from RNA breakdown products.
Using this genome-wide modeling approach we have identified 6 putative HIV-1 encoded
sncRNAs that are within the size range of effector miRNA molecules, which display potential RNA
hairpin structures.
Furthermore, in silico targeting analysis revealed that these HIV-1 encoded sncRNAs may
potentially target mRNAs involved in apoptosis, mRNA transport and mRNA export from
nucleus. These results lead us to speculate that the virus could be using these endogenous
miRNA-like molecules to manipulate T cell differentiation and export of RNAs from nucleus in
particular, to export the unspliced full length viral RNA. The therapeutic implications of such
mechanism are intriguing, because targeting these viral miRNAs could constitute an antiviral
therapy.
11
Recognition and Normalization of Biomedical Entities
Based on Ontologies
André Leal, Bruno Martins and Francisco Couto
Clinical notes in textual form occur frequently in Electronic Health Records (EHRs). They are
mainly used to describe treatment plans, symptoms, diagnostics, etc. Clinical notes are recorded
in narrative language without any structured form and, since each medical professional uses
different types of terminologies according to context and to their specialization, these notes are
very challenging for their complexity, heterogeneity and contextual sensitivity.
Forcing medical professionals to introduce the information in a predefined structure simplifies
the interpretation. However, the imposition of such a rigid structure increases not only the time
needed to record data, but it also introduces barriers at recording unusual cases. One possible
solution consists on the application of text-mining techniques to the clinical texts, in order to
support the recognition and normalization of medical concepts. Together, these techniques can
result in the correct and efficient information gathering by information systems.
We developed an automated system for recognizing medical concepts (i.e, mentions to
disorders) in clinical notes, which then also normalizes them with a UMLS concept unique
identifier (CUI). This system was developed with the intention to overcome some challenges
presented in this task, such as the recognition of non-continuous entities and the normalization
of ambiguous entities.
For the recognition we use the novel SBIEON encoding which contains a tag to specify words
inside recognized entities that are not part of them. We also explore non-annotated clinical
notes to generate lower-dimensional representation of the word vocabulary, and therefore
reduce the data sparsity. Conditional Random Filed (CRF) models were generated based on the
mentioned features among others, such as domain specific lexicon, token shape, etc. For
normalization we use a rule based approach to normalize the recognized entities and we also
take in consideration the information content of each entity for disambiguation. This system
was used to participate in SemEval 2015 Task 14, achieving a second place in the competition.
For future work, we intent to explore semantic similarity between disorder mentions within
individual clinical notes, to improve normalization results. This approach is based on the
assumption that entities inside individual clinical notes should be related between them.
12
Posters
13
Compound Matching of Biomedical Ontologies
Daniela Oliveira and Catia Pesquita
Ontologies model the knowledge in a given domain using concepts, properties, and relations
and are particularly successful in the life sciences. There are several biomedical ontologies that
cover the same field or related fields and, to guarantee their interoperability, it is necessary to
establish meaningful relationships between them.
Ontology Matching techniques were developed to address this problem, since they take
ontologies as input and determine a set of correspondences between semantically related
entities of those ontologies, creating an alignment., which enables the knowledge and data
expressed in the matched ontologies to interoperate.
Compound matching algorithms can find matches between class or propriety expressions
involving more than two ontologies, and thus can improve the integration of ontologies covering
related domains. We define a compound mapping as the correspondence between a class of
one source ontology and two classes of two different ontologies, which together are equivalent
to the source.
We are developing novel algorithms to establish compound mappings integrated into the
AgreementMakerLight (AML) ontology matching system. In a preliminary strategy, we use a twostep approach based on lexical similarity that first aligns the source ontology with the first target
and then matches the unmatched words of the source ontology labels to the second target
ontology. We use a modified Jaccard index to calculate the confidence of the match, by
comparing each word on every class label of both ontologies. Finally, the algorithm has a
selection step, which selects the match with the highest similarity.
To evaluate our strategy we used a set of seven reference alignments automatically created by
inferring compound mappings from logical definitions in OBO ontologies. Preliminary results
using this evaluation approach present low f-measure, however a manual inspection of the top
mappings has revealed the incompleteness of the reference alignments. Future work will involve
the manual evaluation of a portion of the generated mappings, to improve the coverage of the
reference alignments, and the investigation of other algorithms suited to compound matching.
14
Data Mining Analysis of Lung Cancer Electronic Health
Records
Ana Silva, Cátia Pesquita, Lisete Sousa, Alexandra Mayer and Ana Miranda
Lung cancer has one of highest incidence and mortality rates in both genders in the entire world.
In Portugal, both rates are currently showing a growing trend. To improve our understanding of
Portuguese lung cancer patients and their characteristics, we are mining the data collected by
ROR-Sul – Registo Oncológico Regional do Sul. This organization adds all health public
institutions in Lisboa e Vale do Tejo, Alentejo, Algarve and Região Autónoma da Madeira. It’s
mainly work is collect and process all records and make periodic publish of the results.
Our selected data set covers 950 cases of lung cancer which occurred during the first half of
2013. We selected a set of demographic and cancer characteristics variables from lung cancer
patients. In a first step we made some process of data cleaning in R program. Also, we created
new variables, e.g., the age at diagnosis group variable out of the birth date and diagnosis date
variables.
Then we conducted a spatial analysis based in demographic variables, using the Local Indicators
of Spatial Association - LISA - from Moran’s I algorithm available in an R’s package. This analysis
allowed the identification of regions where the incidence rate differs from their neighbors. We
have found these regions Beja, Aljustrel, Serpa, Ourique, Mértola and Castro Verde have a higher
incidence than the neighboring regions.
With this information we can plan the actions for prevention and early detection of lung cancer
more efficiently and with lower costs. This work is still ongoing, and in the following steps we
will explore the full breadth of available variables and other algorithms.
15
Detection of alternative miRNA Processing using the
IsomIR_window
José Gil Lopes, Laura Do Souto, Paula Matoso, Rui Soares, Russel Foxall, Ana Sousa,
Margarida Gama-Carvalho and Andreia J. Amaral
MicroRNAs (miRNAs) are small non-coding RNAs involved in post transcription regulation of
gene expression. IsomiRs, have been described as miRNA variants that differ from the canonical
miRNAs but deriving from the same pre-miRNA, the precursor molecule. The most abundant
classes of IsomiRs are classified into three types: 5’, 3’ and internal isomiRs, which are the result
of differential processing of the pre-miRNA by Dicer or by RNA editing. We have developed a
PERL pipeline, the IsomIR_Window, which allows accessing the complexity of miRNA biogenesis
in Next Generation Sequencing (NGS) data. We show the analysis of the profile of small
noncoding RNAs in CD4+ T cells by NGS using the IsomiR_window. The study included two
datasets. The first dataset comprised the study of two experimental conditions, naive and
activated CD4+ T cells with no biological replicates. Each library derived from an RNA pool
generated from cells collected from nine healthy donors. The second dataset included smallRNA-seq libraries of activated CD4+ T cells obtained from healthy donors in three different
experimental conditions: non-infected (N=3), HIV-1 infected (N=2) and infected with HIV-2
(N=3). Each library was also derived from na RNA pool this time from three individuals. Using as
input the sequences of small noncoding RNAs (sncRNAs) and its frequency in the data, the
IsomIR_window makes an automated search of each sequence in a database of pre-miRNAs and
coordinates of canonical miRNAs. The IsomiR_window retrieves, the ID of the pre-miRNA and
classifies the IsomIR. Results from both the first and second datasets show that isomiRs were
two times more frequent than canonical miRNAs. Furthermore, in regard to the most abundant
isomiRs these displayed an expression either equal or greater than the corresponding canonical
miRNAs. Finally, differentially expressed isomiRs, with significant fold changes and number of
reads have been found between the naive and stimulated conditions, as well as when comparing
healthy with infected cells. These results showed that isomiRs play an important role in T cells.
Finally, although activation and infection leaded to differential isomiR expression, the effect of
activation of T-cells in differential miRNA processing seems to be stronger.
16
Genetics of Familial Paget’s Disease of Bone
Patrícia Santos, Inês Sousa, Vânia Francisco, Joana Xavier, José Patto, Filipe Barcelos
and Sofia Oliveira
Paget's disease of bone (PDB) is a systemic disease characterized by increased bone resorption
and formation, causing gradual destruction of parts of the skeleton and subsequent
reconstruction of a more fragile bone. PDB has an overall incidence of 2% in the population over
55 years. PDB is a complex disease with multiple genes implicated in its pathogenesis, but in its
monogenic form, only one gene (SQSTM1) has been linked to PDB.
To identify novel genes causing familial PDB, we performed whole exome sequencing (WES) in
six individuals from a Portuguese multiplex family composed of five PDB cases, two unaffected
individuals and one individual with unclear diagnosis. Given the uncertain diagnosis for one
family member, we conducted two analyses: model 1, in which this individual is considered
affected and model 2 where he is unaffected. DNA was captured using the SureSelect Target
Enrichment System kit and sequenced using Hiseq2000 (Illumina’s Solexa). We identified three
variants (c.C4786T (KIAA1875), c.C53T (NLRC3) and c.T566C (SRL)) in model 1 and one variant
(c.G180A (SERINC2)) in model 2 that were present in all affected and absent from the unaffected
in next-generation sequencing (NGS) data. Validation of these mutations by Sanger sequencing
in all family members revealed that all model 1 mutations were present in all individuals, while
the model 2 mutation was present in all family members except the individual with unclear
diagnosis. None of these variants were present in a second Portuguese PDB multiplex family.
In conclusion, our findings support the notion that bioinformatics analyses of NGS data is a
process requiring optimization. We found four novel variants which may cause PDB in this family
with an autosomal dominant pattern of inheritance and incomplete penetrance. Further studies
in other PDB families are warranted to determine the pathogenic potential of these
genes/variants.
17
Integrated in silico analysis of the neuronal
transcriptome of SMA disease models: untangling the
gene regulatory networks underlying motor neuron
degeneration at the cell- and systems-level
Hugo A F Santos, Andreia Amaral, Takakazu Yokokura, David Van Vactor and Margarida
Gama-Carvalho
Spinal Muscular Atrophy (SMA), a lethal inherited neurodegenerative disorder, is characterized
by low levels of the Survival of Motor Neuron (Smn) protein, which is essential for the assembly
of spliceosomal small nuclear ribonucleoproteins (snRNPs). Strikingly, low levels of this
ubiquitous protein mainly affect motor neurons (MNs), disrupting neuromuscular junctions
(NMJs) and leading to MN degeneration. Despite robust knowledge of SMA’s genetics, the exact
molecular mechanisms underlying the disease’s phenotype remain largely elusive, preventing
the development of rational therapeutics. One possibility is low levels of Smn have a higher
impact in the expression and splicing of genes critical for MN function and survival, or that these
cells are intrinsically more sensitive to global changes in RNA processing. Alternatively, Smn may
be involved in MN specific functions. Possibly both hypothesis are applicable. To address the
relevance of Smn-dependent changes in neuronal gene expression, we performed RNA-seq to
obtain an unbiased profile of the central nervous system transcriptome of a Drosophila
melanogaster SMA disease model. Upon SMN down-regulation we observe changes in exon
usage in a particular subset of genes crucial for neuronal development, viability and NMJ
function. This suggests that SMN-dependent changes in the splicing machinery do not have
widespread effects, affecting specific genes possibly due to the existence of certain features in
their sequence or structure. Interestingly a large proportion of identified genes with altered
splicing are known genetic modifiers of the NMJ phenotype in SMA fly models, thereby
supporting the biological relevance of our data. By further assessing the significance of the
associated cellular functions and pathways where the identified genes are involved we aim to
generate and test hypothesis regarding their potential contribution to the establishment of the
SMA phenotype.
18
Mining cardiac side-effects for known drugs
Joana Barros and André Falcão
Despite the population growth, the drug research and development process (R&D) has roughly
maintained unaltered since 1960 and it isn’t well suited for today’s requirements. On average,
the probability of a compound passing all the R&D stages and achieve commercialization is
estimated at only 16%. In silico methods are commonly used as an alternative research method
in the drug development process to help reduce its cost and duration. One important and
unintentional drug target is the hERG protein. This ion channel is responsible for mediating the
rapidly activating component of the delayed rectifying potassium current in the heart (IKr),
making it an important component of the heart normal function. Its inhibition can lead to fatal
cardiac arrhythmias making it an important study case. It is estimated that about 40% and 70%
of all new drug-like molecules affect hERG. This research aims to develop a prediction tool to
identify, in the early development stages, potential drug candidates that inhibit the hERG
channel. Using Quantitative Structure-Activity Relationship (QSAR) methods we built
computational models to find a significant relation between molecular properties and the
compound bioactivity. These models were built using molecular descriptors from several public
chemical packages e.g. (RDKit, CDK and E-Dragon) as well as molecular fingerprints. Since it was
expected that not all molecular descriptors were necessary to build the prediction model we
experimented various methods of variable reduction. Several methodologies were used for
variable selection; namely Principal Components Analysis, Elastic Nets, Random Forests and
Linear Regression. Of those, the latter method provided the best and most reliable results which
were then used in a stepwise process in a Support Vector Machine Model. This approach was
applied to each set of variables individually and to different data set combinations. The
preliminary results obtained from simple cross validation show that significant models using the
RDKit descriptor set coupled with Fingerprints were able to produce the best results reaching
over 61% of explained variance. To further improve the prediction model we plan to implement
different molecular similarity descriptors, obtained using NAMS, and also develop a free online
prediction tool for public use.
19
Modelling drug response for closely related CNS GPCR
proteins
Vânia Ferreira and André Falcão
Drug development is a complex and expensive process and one of the hottest fields of modern
science. The complexity of the field and the high cost of screening new compounds led to the
development of in silico models for identification of new pharmacologically active compounds.
Nonetheless a new drug that has been predicted active is many times prone to secondary effects
as most molecules can bind to more than one target. The molecules' binding affinity
conservation between different receptors can be a starting point to understand which molecular
structural properties have a more relevant role in receptor binding. With this information, it is
possible to decrease the range of possible molecules as targets for drug discovery, making the
first steps on selecting new potential drugs for testing a faster and less expensive process.
In this work we have analyzed different bioactivity patterns for several molecules for which it is
known that they bind to different CNS G-Protein Coupled Receptors (GPCR) proteins; namely
different serotonin and dopamine receptors in Homo sapiens and Rattus norvegicus. We have
used a Neighbor-Joining phylogenetic tree and a sequence identity matrix to first identify the
evolutionary relation between all the receptors and then to select the most structurally
conservative pairs between them. Subsequently, the bioactivity values (Ki) for the binding
molecules of these receptors were collected from ChEMBL. In particular we wanted to verify
how sequence similarity impacts drug response for closely related proteins, namely Dopamine
and Serotonin receptors. In total we tested 19 different receptors - pairwise compared - and
their binding affinities for the same molecules.
Differently from what was expected we have found no significant relationship between
sequence similarity and binding affinities. Nonetheless several highly significant binding
relationships between different receptors emerged. These patterns apparently are not related
to sequence similarity nor to the primary binding target of those receptors. Namely significant
relationships were identified between dopamine and serotonin receptors [e.g. 5HT2c and D1].
Moreover, we searched for a bioactivity pattern between the same receptors from Homo
sapiens and Rattus norvegicus and a strong relation became evident between bioactivity levels
of their molecules.
20
Resist: An Intelligent System to Predict Antibiotic
Resistance
João Nascimento and Cátia Pesquita
The recent advances in technology and computation power and the expanding use of electronic
health records have opened new avenues of research that explore the information in these
records to improve healthcare, namely in diagnosis and therapeutic prescriptions diagnosis and
therapeutic prescriptions.
One increasingly relevant public health concern is antibiotic resistance. This phenomenon
happens when some sub-populations of a microorganism survive after exposure to antibiotics,
becoming more difficult to control. The World Health Organization has already stated that unless
the antibiotic resistance's growing trend is reduced, we are heading towards a post-antibiotic
era, where the death rate of common infection will rise due to the expected failure of standard
medical treatments.
This project's goal is to investigate if it is possible to develop supervised learning models that
are able to classify patients regarding their antibiotic resistance risk using the information that
is usually collected at a clinical and laboratorial level and stored in electronic health records in
Portuguese hospitals. We are interested in investigating the potential of variables such as time
of the year, geographical location and demographics as well as suspected infection location.
After pre-processing the data using data cleaning, standardization and transformation
techniques, we are now devising and applying machine learning based strategies to train a
model for antibiotic resistance prediction at the patient level.
The ability to successfully predict antibiotic resistance risk can have a significant impact
worldwide, because it can help clinicians in selecting appropriate antibiotics. This can help
reduce antibiotic resistance levels, improve patient treatment, and ultimately decrease health
care costs.
21
VetBDI – Development of an Integrated Database in
Veterinary Medicine
Ricardo Faustino, Daniel Simões, Renata Neves, Daniel Teixeira, Fredy Pinheiro, Micael
Faustino and Liliana Marques
The importance of bioinformatics database for clinical decision making has been steadily
increasing over decades. In order to develop new biomarker and statistical correlation studies
we are working to improve an integrated database in veterinary medicine using biochemistry
and medical imaging data.
However, the imaging analyses depend directly on the expertise level of vet doctors,
sonographer or other imaging specialists. This is a very important aspect, because the results, in
some cases, are very subjective or unspecified. To solve this problem we are using the statistical
results correlation between biochemistry and medical imaging information (Pierroti M.P.- S.R.L
– X-Ray Unit).
Clinical chemistry data are decisive for evaluating altered organ function in animals. Blood
samples were taken and analysed for electrolytes, substrates, metabolites and enzymes. All data
were obtained using Mindray BC-2800Vet medical devices, frequently employed in most
veterinary clinics. We used samples of two different species, cats and dogs. The biochemical
parameters used are: GLU-PS, TP-PS, ALB-PS, GPT-PS, GOT-PS, AMYL-PS, BUN-PS, CRE-PS,
Lymph#, Mon#, Gran#, Lymph%, Mon%, Gran%, Eos%, and Histograms for WBC, RBC and PLT.
We also performed multiple quantifications of discrete imaging findings using a Java-based
image-processing program. An important component of the discovery, characterization,
validation and application of biomarkers is the extraction of information and meaning from
images through image processing and subsequent analysis. Associations between these changes
and disease state can be analysed using classifiers, like support vector machines (SVM).
In conclusion, we are creating an integrated database in medicine to develop new biomarkers.
Data obtained in this process will be crossed with biochemistry and imaging data, which will
produce accurate, reproducible and feasible information over time.
22
Visualizing the incoherencies in Bioportal
Catarina Martins, Catia Pesquita, Ernesto Jimenez-Ruiz and
Emanuel Santos
Bioportal is a web portal that provides access to a large number of biomedical ontologies, in
OBO format or OWL format, and to the mappings between them. The mappings are
automatically generated or added manually by experts. However, sometimes those mappings
are not compatible with each other and can lead to conflicts in the alignments due to erroneous
mappings or even incompatibilities between the ontologies. Therefore, it is important not only
to find the conflicts between the ontologies but also find an intuitive way of identifying them.
In order to solve this problem, AgreementMakerLight (AML) and LogMap, two ontology
matching tools, applied their repair algorithms in 19 pairs of ontologies from Bioportal and
discovered that 11 in 19 had logical errors involving in average 22% of the mappings. The
creation of a visualization tool to identify the incoherencies between the ontologies would be
very helpful to the scientific community allowing a more intuitive and analyzing possible
conflicting mappings.
We present the preliminary version of a web tool that supports the visualization of conflicting
mappings and their context. The backend of the tool includes a database with the necessary
ontology data and mappings, as well as the conflict sets data precomputed using AML and
LogMap. The frontend allows users to select a mapping from a list, and then access the
information about the associated conflicts in two different formats: visualize a graph that shows
the conflicting mappings and the ontology axioms that are behind the conflict; or in a table, that
lists the different sets of conflicting mappings.
Future work will include the extension of the tool to permit users to manually solve conflicts,
and export the repaired alignments. We will also directly link our tool to BioPortal, to support
access to all the ontologies and mappings it contains.
23