BioScience on the TeraGrid

BioScience on the TeraGrid
Daniel S. Katz
[email protected]
Director of Science, TeraGrid GIG
Senior Fellow, Computation Institute,
University of Chicago & Argonne National Laboratory
Affiliate Faculty, Center for Computation & Technology, LSU
Adjunct Associate Professor, Electrical and Computer Engineering, LSU
[email protected]
What is the TeraGrid
• World’s largest distributed cyberinfrastructure for open scientific research,
supported by US NSF
• Integrated high performance computers (>2 PF HPC & >27000 HTC CPUs),
data resources (>2 PB disk, >60 PB tape, data collections), visualization,
experimental facilities (VMs, GPUs, FPGAs), network at 11 Resource Provider
sites
• Allocated to US researchers and their collaborators through national peer-review
process
• DEEP: provide powerful computational resources to enable research that can’t
otherwise be accomplished
• WIDE: grow the community of computational science and make the resources
easily accessible
• OPEN: connect with new resources and institutions
• Integration: Single {portal, sign-on, help desk, allocations process, advanced
user support, EOT, campus champions}
http://www.teragrid.org/
[email protected]
Governance
• 11 Resource Providers (RPs) funded under separate
agreements with NSF
–
–
–
–
Different
Different
Different
Different
start and end dates
goals
agreements
funding models
• 1 Coordinating Body – Grid Infrastructure Group (GIG)
–
–
–
–
University of Chicago/Argonne National Laboratory
Subcontracts to all RPs and six other universities
7-8 Area Directors
Working groups with members from many RPs
• TeraGrid Forum with Chair
[email protected]
Who Uses TeraGrid (2009)
(2008)
[email protected]
How TeraGrid Is Used
Use Modality
Batch Computing on Individual Resources
Exploratory and Application Porting
Workflow, Ensemble, and Parameter Sweep
Science Gateway Access
Remote Interactive Steering and Visualization
Tightly-Coupled Distributed Computation
Community Size
(rough est. - number of users)
850
650
250
500
35
10
2006 data
[email protected]
How One Uses TeraGrid
RP 1
RP 2
POPS
(for now)
User
Portal
Science
Gateways
TeraGrid Infrastructure
Accounting, …
(Accounting, Network,Network,
Authorization,…)
Command
Line
RP 3
Compute
Service
Viz
Service
Data
Service
[email protected]
User Portal: portal.teragrid.org
http://portal.teragrid.org/
[email protected]
Science Gateways
• A natural extension of Internet & Web 2.0
• Idea resonates with Scientists
– Researchers can imagine scientific capabilities provided through
familiar interface
• Mostly web portal or web or client-server program
• Designed by communities; provide interfaces understood by
those communities
– Also provide access to greater capabilities (back end)
– Without user understand details of capabilities
– Scientists know they can undertake more complex analyses and that’s
all they want to focus on
– TeraGrid provides tools to help developer
• Seamless access doesn’t come for free
– Hinges on very capable developer
Nancy Wilkins-Diehr
[email protected]
TeraGrid -> XD Future
• Current RP agreements end in March 2011
– Except track 2 centers (current and future)
• TeraGrid XD (eXtreme Digital) starts in April 2011
– Era of potential interoperation with OSG and others
– New types of science applications?
• Current TG GIG continues through July 2011
– Allows four months of overlap in coordination
– Probable overlap between GIG and XD members
• Blue Waters (track 1) production in 2011
[email protected]
Grid Enabled Neurosurgical Imaging Using Simulation (GENIUS)
Model large-scale patient-specific cerebral blood flow in clinically-relevant time scale
• Provide simulation support within the operating theatre for neuroradiologists
• Provide new information to surgeons for patient management and therapy:
1.
Diagnosis and risk assessment
2.
Predictive simulation in therapy
• Provide patient-specific information to help plan
embolisation of arterio-venous malformations,
coiling of aneurysms, etc.
Clinical workflow:
• Book computing
resources in advance
or use preemption
• Shift imaging data
around quickly over
high-bandwidth lowlatency dedicated links
• Interactive
simulations and realtime visualization for
immediate feedback
Peter Coveney, University College London
[email protected]
OLSGW Gadgets
•OLSGW Integrates bio-informatics applications
•BLAST, InterProScan, CLUSTALW , MUSCLE, PSIPRED, ACCPRO, VSL2
•454 Pyrosequencing service under development
•Four OLSGW gadgets have been published in the iGoogle gadget directory. Search for “TeraGrid
Life Science”.
Wenjun Wu, Thomas Uram, Michael Papka, ANL
[email protected]
Multiscale Simulation of Arterial Tree
Arterioles/venules 50 microns
activated platelets
Platelet diameter is 2-4 µm
Normal platelet concentration in
blood is 300,000/mm3
Functions: activation, adhesion
to injured walls, and other
platelets
Need to combine multi-scale models: 1D (arteries), 3D Navier Stokes
(organs, arterial junctions, etc.), Dissipative Particle Dynamics
(capillaries, venules, arterioles, blood cells, etc.), Molecular Dynamics
(blood cells, platelets, molecular adhesion, etc.)
NIH/NSF-IMAG project: George Em Karnaidakis, Brown
[email protected]
Expressed Sequence Tag (EST) Pipeline
• ESTs are a collection of random cDNA sequences, sequenced from a cDNA library or
sequencing devices
– Typical inputs are O(Million) sequences
– Newer 454 devices from higher volume, are relatively easy to obtain and operate
– Stored using FASTA format
• ESTs are clustered and assembled to form contigs
• Contigs then used to identify potential unknown genes, by Blasting against known protein
database
• Goal: Use TeraGrid for backend computing, with existing software, and a gateway frontend
RepeatMasker
PaCE
CAP3
BLAST
• Cleaning sequences
•Clustering
•Assembly
•Identification
• Serial execution on
split input, e.g., 1000
jobs for 2 million
sequences
•1 MPI job, runtime of
several hours
•Exponential growth in
time with growth in input
data; scales well
•Serial runs on clusters
generated by PaCE –
Clusters can be combined
•Varied sizes with varied
resource requirements
(run times: ms – days)
•Serial – Takes CAP3
results. Number of jobs
controlled by adjusting
number of sequences per
job.
Initial results – run that took 5 days on local cluster done in 2 days – more opt. underway
A. Kulshrestha, S. L. Pallickara, K. N. Muthuram, C. Kong, Q. Dong, M. Pierce, H. Tang, IU
[email protected]
Multiscale Computer Simulation of the Immature HIV-1 Virion
Experimental structures
Coarse-grained (CG) model
development
CG simulation
Wright, Schooler, Ding, Kieffer,
Fillmore, Sundquist, Jenson,
EMBO, 26, 2218, 2007
CG model refinement
Atomic-level simulation
Key CG interactions
New CG Interactions from MD
An iterative modeling approach combining experimental imaging (cryo-electron
tomography), coarse-grained (CG) simulation, and atomic-level molecular dynamics (MD)
G. A. Voth, U. of Chicago
[email protected]
CIPRES Portal: A New Science Gateway for Systematics
• Systematics: study of diversification of life and relationships
among living things through time
• CIPRES: a flexible web application that can be sustained by the
community at minimal cost even beyond the funding period of
the project
• Tools include parallel versions of MrBayes, RAxML, GARLI
• User requirements include:
–
–
–
–
Access to most or all native command line options
Add new tools quickly
Provide personal user space for storing results
Use TeraGrid resources to quickly provide results
• Cited in at least 35 publications, including Nature, PNAS, Cell
– Examples: New Family Tree for Arthropoda, Genome Sequence of a
Transitional Eukaryote, Co-evolution of Beetles and Flowering Plants
• Used routinely in at least 5 undergraduate classes
• Use 77% US (incl. 17 EPSCoR states), 23% 33 other countries
Mark Miller, SDSC
[email protected]
Patient-Specific HIV Drug Therapy
HIV-1 Protease is a common target for HIV drug therapy
•
Enzyme of HIV responsible for protein maturation
•
Target for anti-retroviral Inhibitors
•
Example of structure assisted drug design
•
9 FDA inhibitors of HIV-1 protease
So what’s the problem?
•
Emergence of drug resistant mutations in protease
•
Render drug ineffective
•
Drug resistant mutants have emerged for all FDA inhibitors
•
Too many mutations to be interpreted by a clinician
Solution: build a Binding Affinity Calculator (BAC)
• Provide tools that allow simulations to be used in clinical context, including lightweight client
– User only needs specify enzyme, mutations relative to wildtype, drug
• Others options can be specified but begin as default
• Requires large number of simulations to be constructed and run automatically (across
distributed HPC resources)
– To investigate generalisation
– Automation is critical for clinical use
• Turn-around time scale of around a week is required
• Trade off between accuracy and time-to-solution
Initial results – ensemble MD calculations for lopinavir vs wildtype & five mutants –
appear promising; excellent relative ranking in binding free energies
Peter Coveney, University College London
[email protected]
Scripting Protein Structure Prediction
int nSim = 1000;
int maxRounds = 3;
Protein pSet[ ] <ext; exec="Protein.map">;
float startTemp[ ] = [ 100.0, 200.0 ];
float delT[ ] = [ 1.0, 1.5, 2.0, 5.0, 10.0 ];
foreach p, pn in pSet {
foreach t in startTemp {
foreach d in delT {
ItFix(p, nSim, maxRounds, t, d);
}
}
}
1000
predict()
calls
…
Analyze()
ItFix()
{
foreach sim in [1:nSim] {
(structure[sim], log[sim]) = predict(p, t, d);
}
result = analyze(structure)
}
10 proteins x 1000 simulations x 3 rounds x 2 temps x 5 delta-T’s
= 300K application runs
T. Sosnick, K. Freed, G. Hocky, J. DeBartolo, A. Adhikari, J. Xu, W. Wilde, U. Chicago
[email protected]