Document 303596

Metrabase
The Metabolism and Transport Database user manual v2014.01 Contents 1 INTRODUCTION 3 2 METRABASE CONTENT 4 2.1 2.2 2.3 2.4 7 9 10 13 ACTIVITIES (INTERACTIONS) PROTEINS COMPOUNDS DATA SOURCES 3 THE TRANSPORTER SUBSTRATE DATASET 14 4 USAGE 16 1 Introduction
Metrabase is a cheminformatics and bioinformatics
resource that contains manually curated structural,
physicochemical and biological data related to small
molecule transport and metabolism.
Metrabase offers structured and easily accessible data
on interactions between proteins and chemical
compounds, providing not only actions and measured
activities, but also chemical structural information,
tissue expression data and negative action types that
are essential in modelling activity. ‘Easily accessible’
refers to computational processing in the first place.
Even when data is made available, an easy way to
process it computationally is quite often missing in a
range of freely available resources (e.g. an online
search and browse facility offered, but not download).
We aim to construct a comprehensive, thoroughly
annotated and easy to use resource of high quality
small molecule metabolism and transport information.
In particular by covering the areas of biochemistry,
pharmacology and toxicology, we hope diverse
research communities will find Metrabase useful and
valuable.
3 2 Metrabase content
Metrabase version 1.0 contains curated data related to human transport and metabolism of chemical
compounds. Its primary content includes over 3000 small molecule substrates and modulators of transport
proteins and, to a smaller extent, cytochrome P450 enzymes (CYPs).
Proteins
20 transporters and 13 CYPs
20 transporters
13 CYPs
Compounds
3438
3307
212
Interactions
11649
11143
506
References
1211
1177
36
The major focus of Metrabase v1.0 is on transport proteins: specifically, on their interactions with small
molecules that were experimentally found to be (or not to be) substrates.
4 5 Metrabase 1.0 schema
6 2.1 Activities (interactions) The key information held in the ‘activities’ table of the database covers the interactions between proteins and chemical
compounds, indicating the compound action type as either substrate, non-substrate, inducer, non-inducer, repressor,
inhibitor, non-inhibitor, stimulator or binder (the ‘action_type’ field).
Protein activity
(transport or catalysis)
Compound activity
(affecting protein activity/expression)
Action types
substrate
non-substrate
inhibitor/repressor (negative modulators)
stimulator/inducer (positive modulators)
non-inhibitor/non-inducer (inactive compounds)
Action type was set to binder where it did not fall into any of these categories, but the molecule was found to bind to the protein.
key fields: cmpd_id – protein_id – ref_id – action_type – species
(However, in version 1.0 species = ‘human’ for all the records and so can be omitted.)
7 •
•
•
•
•
Compounds were categorised as substrates or non-substrates according to the results presented in the publication
providing the data point and no further evaluation was carried out on our side.
Care must be taken with respect to the current status of the inhibition records, since depending on the
measurement threshold (e.g. percentage inhibition) some of the inhibitors can be regarded as non-inhibitors and
vice versa. A proper classification of compounds as either inhibitors or non-inhibitors is planned for subsequent
releases of the database.
Other ‘activities’ fields holding additional extracted data and annotations, such as assay descriptions, relevant
experimental measurements, cell systems, compound concentrations and the substrates used in inhibition assays,
may have only been partially completed in this release. This is partially due to assay information not being
included in most of the reviews.
The ‘published_label’ field contains chemical names, abbreviations or designations employed in publications to
label compounds. This field has been completed for all except records linked to the external datasets and can be
used to easily identify compounds in their respective publications.
Activity types were mostly accepted as found in the publications and therefore they may be overlapping.
Consequently, selecting all the activity types relevant for one’s search is recommended.
8 2.2 Proteins •
•
•
The proteins contained in Metrabase are categorised as either transporters or enzymes (the ‘protein_type’ field)
and are provided with the HUGO Gene Nomenclature Committee (HGNC) approved symbols and names
(www.genenames.org) as well as UniProt IDs. Protein sequences for the indicated isoforms were included from
UniProt (www.uniprot.org). Other fields include additional information, such as Gene, RefSeq and Ensembl IDs
and TC (Transporter Classification) or EC (Enzyme Commission) numbers.
Metrabase also contains information about protein expression levels across healthy human tissues. Part of this data
is based on immunohistochemistry using tissue microarrays (gene, tissue, cell type, level, expression type and
reliability) and comes from the normal_tissue.csv file of the Human Protein Atlas (HPA) v9.0
(www.proteinatlas.org).
All other expression records contain data that was extracted from the literature. The levels of expression (mRNA
and/or protein levels) for non-HPA records (i.e. where ‘ref_id’ is not null): expressed (if the level had not been
specified), none, none-low, low, low-medium, medium, medium-high and high.
9 2.3 Compounds •
•
The total number of records in the ‘compounds’ table is 3562, but the number of compounds with recorded
interaction data for both transporters and enzymes is 3438. The remaining compounds are used in other tables,
such as ‘cmpd_variants’, which holds stereoisomers, multi-component structures and different forms of a
compound.
Molecular structures are available in MDL molfile format and as absolute (unique and isomeric) SMILES strings
(in Kekulé form). They were mostly verified using the Chemspider (www.chemspider.com) and/or SciFinder
(www.cas.org/products/scifinder) databases. The standard InChI and InChI Key strings were generated using
v1.04 of the InChI software (http://www.inchi-trust.org).
10 •
•
•
The great majority of the compounds are small organic molecules and all the other types (coordination complexes,
inorganic compounds, metalloid-containing compounds, selenium-containing compounds and polymers) are listed
in the ‘compound_types’ table. This table also contains the DrugBank types of drugs (approved, experimental,
illicit, investigational, nutraceutical and withdrawn) taken from DrugBank v3.0 (www.drugbank.ca) and can easily
be improved by annotating compounds further, for example, as natural products including their subtypes (e.g.
natural product: terpene: sesquiterpene).
The ‘properties’ table contains selected molecular properties that were calculated/predicted for all (molecular
mass) or just the small organic single-component structures (constitutional descriptors: atom and bond counts,
hydrogen bond donor and acceptor counts, ring count and rotatable bond count; log P and log D) using
ChemAxon’s Calculator (cxcalc) v6.1.3 (www.chemaxon.com). Experimental properties are not currently
provided, i.e. ‘properties.type’=’c’ for all records (where ‘c’ stands for ‘calculated’). The multi-component
structures can easily be identified using the ‘compounds.fragment_count’ field and their single-component
counterparts using the ‘cmpd_variants’ table).
The ‘synonyms’ table contains chemical names of Metrabase compounds (systematic, semi-systematic, common,
trade names, abbreviations, codes). One of the synonyms was selected as the main name (the
‘compounds.cmpd_name’ field) for each compound. Chemical names were obtained mostly from DrugBank
11 •
•
(these might refer to compound variants as well) and SciFinder. The systematic (IUPAC) names were computer
generated using the ChemAxon’s IUPAC Naming Plugin v6.1.3 (the ‘compounds.iupac_name’ field).
The ‘cmpd_ids’ table contains external compound IDs. Most of the compounds have ChemSpider IDs (CSIDs)
and only if CSID had not been found, CAS Registry Number was provided (CASRN; CAS Registry Number is a
Registered Trademark of the American Chemical Society). DrugBank IDs are also included were identified
(especially for the approved drugs).
MBCD number is the compound identifier in Metrabase, e.g. mbcd0027084 (MBID for compounds).
cmpd_id: mbcd0027084 (CSID:14034)
cmpd_name: Ethidium bromide
smiles: [Br-].CC[N+]1=C(C2=CC=CC=C2)C2=CC(N)=CC=C2C2=CC=C(N)C=C12
std_inchi: 1S/C21H19N3.BrH/c1-2-24-20-13-16(23)9-11-18(20)17-10-8-15(22)12-19(17)21(24)14-64-3-5-7-14;/h3-13,23H,2,22H2,1H3;1H
std_inchikey: ZMMJGEGLRURXTF-UHFFFAOYSA-N
iupac_name: 3,8-diamino-5-ethyl-6-phenylphenanthridin-5-ium bromide
formula_dot: C21H20N3.Br
fragment_count: 2
12 2.4 Data sources •
•
•
The ‘datasources’ table contains the sources of data in the database, including information about
software that was used to calculate molecular properties. The ‘datasource_id’ and ‘datasource_version’
fields indicate the source of all Metrabase records where applicable.
The ‘refs’ table contains the publications’ citation information (bibliographic fields) and links. Most of
them (91%) are original peer-reviewed research articles and the aim remains to link all Metrabase
records to primary literature sources (7% are reviews). PubMed IDs are provided where available, as
well as DOIs (if DOI was not available, URL is given instead in the ‘doi_url’ field).
Attach http://dx.doi.org/ to DOI to resolve a DOI, e.g. http://dx.doi.org/10.1021/ac0354342.
13 3 The transporter substrate dataset
We aim to provide a version of the transporter substrate dataset (MBTPsubDS) as a supplement to each
Metrabase release. Each MBTPsubDS version contains interactions between small molecules and
transporters, and includes all the unique substrate and non-substrate records obtained from Metrabase and
processed to facilitate human transporter data analysis and predictive modelling (by 'unique' we mean the
unique (cmpd_id, protein_id, action_type) tuples).
MBTPsubDS1_0
MBTPsubDS1_0a
based on Metrabase v1.0; all the interactions involving conflicting action types (where a
compound was found to be both a substrate and a non-substrate of a single transporter)
were excluded
some of the conflicting action types were resolved upon our evaluation of such records and
the corresponding compound-transporter pairs were added to MBTPsubDS1 where we
thought we could consider the compound as either a substrate or a non-substrate
14 15 4 Usage
Web interface
Search by protein
Search by compound
Expression data
Protein list
Download
Local MySQL database
To load Metrabase from a dump file (metrabase1_0.sql), you
should first create a database on your system and then load
the dump file, for example like this:
# tar -xzvf metrabase1_0.tar.gz
# mysql -u username -p
mysql> CREATE DATABASE metrabase;
# mysql -u username -p metrabase < metrabase1_0.sql
MySQL Workbench can be used as an interface for MySQL. 16 17 18 19 20 21 22 23 Credits
•
Metrabase was developed by Lora Mak in collaboration with David Marcus, Andreas Bender and
Robert C. Glen at the Unilever Centre for Molecular Sciences Informatics and Galina Yarova, Guus
Duchateau and Werner Klaffke at Unilever, with the much appreciated help from the following (at
the time) 2nd and 3rd year undergraduate students of the University of Cambridge: Claire Dickson,
Joseph Dixon, Ivan Lam, Richard Lewis, Callum Picken, Claudia Pop, Heyao Shi, Emma Stirk,
Yasmin Surani, Paddy Szeto, Nathaniel Wand, Julian Willis and Jing Xiangyi.
•
Metrabase's web interface was developed by Andrew Howlett at the Unilever Centre for Molecular
Sciences Informatics. Andrew also designed the Metrabase logo.
•
Metrabase was realised and is being maintained in the Glen group.
24 Licensing
Metrabase is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/)
However, with respect to the integrated data, such as the TP-Search, ChEMBL and Human Protein Atlas
records that are distributed as part of Metrabase, the user is referred to each external data source regarding
their respective licensing. This means that the integrated data retains the licensing of the original data
sources. The TP-Search and ChEMBL records may have been modified and augmented, while the Human
Protein Atlas records were included unmodified.
Attribution
We hope you find our database and the associated datasets useful. If you use it, please acknowledge:
Metrabase v1.0, University of Cambridge, http://www-metrabase.ch.cam.ac.uk
25 Metrabase - http://www-metrabase.ch.cam.ac.uk
Contact: [email protected]
Unilever Centre for Molecular Sciences Informatics
Department of Chemistry, University of Cambridge
Lensfield Road, Cambridge, CB2 1EW, UK
This document was prepared by Dr Lora Mak and reviewed by Prof Robert C. Glen.
© 2014 Metrabase Development Team, University of Cambridge. All rights reserved.