Bringing Compute to the Data

Bringing Compute to the Data
Alternatives to Moving Data
Part of EUDAT’s Training in the Fundamentals of
Data Infrastructures
Introduction
• Why consider alternatives?
• The “traditional” approach
• Alternative approaches:
– Distributed Computing
– Workflows
– Bringing the Compute to the Data
Why should alternative approaches be
considered?
• Moving data is still hard, even when you’re using
the right tools.
• Data volumes are expected to continue to
increase, and this is expected to happen more
rapidly then increases in transfer speeds
• Alternatives require thinking about things
differently, so it may be wise to start thinking
about alternatives before current techniques
break down
“Traditional” Approach
• Input data is stored at location A
• Compute resource is at location B
• Output data is required at location C
1. Move data from A to B
2. Perform computation at B
3. Move data from B to C
A
B
(A & C are often the same place)
C
“Traditional” approach
Data
Compute
Alternative Approaches: A Disclaimer
• None of the following approaches provide a
silver bullet!
• Not all approaches will be useful for all problems
– and in some case, using these approaches can make
things worse
• These should complement existing approaches
and be used where appropriate
Distributed Computing
• Here, the idea is that you might not need to do
all of the compute at B.
• In general, this approach could make things
worse, depending on your data transfer pattern
• It will not be suitable for all kinds of problem
• Many of the considerations here are traditional
“parallel computing” concepts
Distributed Computing as Parallel
Computing
• Is the problem “trivially parallel”? Is it possible to
solve parts of the problem using only part of the
input data, and simply recombine the output at
the end of a run?
• If all “processors” have access to all the data at
the start, is it then possible for them to proceed
with little or no communication during the runs?
• If there is the need to communicate during a run,
how intensive are these communications? Do
you have all-to-alls?
When might Distributed Computing be a
good alternative?
• When input data starts off distributed
– Fairly common with large scale experimental data:
• Sensors, detectors, etc.
– When input data is already mirrored
– When you’ve had to move the data before anyway
and you could have moved it to multiple places
instead of just one
• When the computation is trivially parallelisable
or requires only limited communication
A
B1
B2
B3
B4
C
A1
A2
A3
A4
B1
B2
B3
B4
C
A1
A2
A3
A4
B1
B2
B3
B4
C
A1
B1
B2
C
A1
A2
B1
B2
B3
C
Is this “Grid Computing”?
• There are definite overlaps between these ideas
of distributed computing and the “grid
computing” that promised so much in the last
decade…
• Grid is not such a “cool” topic anymore, but
many of the ideas could be reused in different
contexts (possibly hidden from an end-user)
• This way of computing may still come into its
own for certain kinds of big data problems
Scientific Computation “in the cloud”?
• Likely to be a while before this can get close to
existing approaches in terms of efficiency, but it is
being used in some places
– e.g. Amazon has “Cluster Compute” and “Cluster GPU”
instances (see http://aws.amazon.com/hpc-applications/)
• Some data sets are already “in the cloud”, e.g.
– Annotated Human Genome Data provided by ENSEMBL
– Various US Census Databases from The US Census
Bureau
– UniGene provided by the National Center for
Biotechnology Information
– Freebase Data Dump from Freebase.com
Big Input Data
• Likely to become more common as more and
more data is stored and available for re-use
• Projects like EUDAT will make it easier to access
to stored data
• This will be the case for much data-intensive
science
– Where here I use this term in the context of “the fourth
paradigm”: computers as datascopes
Workflows
• Related to distributed computing
• Sometimes referred to as “programming in the
large”
• Again, this potentially requires more data
movement
• The idea is to break the computation down so
that some of it can be done at A, some of it can
be done at B, and some of it can be done at C.
• Also, instead of doing everything at B, this could
instead be done at B1, B2, B3, B4, …
Simple Motivating Example
Big Input
Data
A
B
Small Output Data
C
A
A
B1
B
B2
C
C
or a more realistic case?
Image Source: http://pegasus.isi.edu
Difficulties with this approach
• Change to computation algorithm likely
– A trade-off, but it might only need to be done once…
• Orchestration
– Coordinating computation at multiple sites
– Workflows can help with this
• Can help to address the added complexities of
– Multiple “jurisdictions” / access policies
– Job scheduling
– Automation
Approaches to orchestration
• Local
– Each compute service works independently
– Data can be pushed or pulled between services (or some
combination)
– The route that the data should take can be
• passed with the data
• predetermined at the service
• communicated manually to the service for each “run”
• Orchestrated
– The usual workflow approch
– A workflow engine communicates with services or
processing elements to control data flow
An aside: Push & Pull
Push
• Service 1 completes
processing.
• Service 1 makes a call to
service 2 and sends the
data to service 2
• The arrival of data
triggers service 2 to run
Pull
• Service 1 runs and stores
its output locally
• Service 2 runs (triggered
manually)
• Service 2 initiates data
transfer from service 1
Service 1
Service 1
Service 2
Service 2
Workflow Engines
• Scientific Workflows
– Kepler, Taverna, Triana, Pegasus (Condor), VisTrails
– Unicore, OGSA-DAI (for database-oriented flows)
• General Purpose / Business Orientated
– Service Oriented Architecture Solutions
– BPEL engines, e.g.,
• Oracle BPEL Process Manager
• SAP Exchange Infrastructure
• WebSphere Process Server
– Many of these based on web services
• Datacentre orientated
– Hadoop (MapReduce), Storm (stream processing)
Moving the Compute to the Data
• A more general “idea” which is related to both
the previous approaches
• This approach relies to some extent on having
an infrastructure that supports this approach
• Can work particularly well where A and C are
the same place
Computing Close To The Data
• Relational Database Systems
– Send a query as SQL
• Virtual Machines
– Send a VM image to a virtualisation environment on a
machine which can directly mount the data
• Allow a user to submit a script or executable on a
machine close to the data
• SPARQL endpoints on RDF triple stores
• Data Services (e.g. as Web Services) with some API
beyond file transfer
– Prefiltering / transformation / subsetting
– Application As A Service
Implications for Data Centres
• These approaches rely on data centres to
provide computational resources and services
• Cons:
– Interface required to accept query or compute job
– Compute/processing resources required
• Pros:
– Less strain on the network
Conclusions
• Data movement will always be required
• Moving large amounts of data is never likely to
be easy
• There is not one single solution, but by
considering alternative approaches to big data
problems may help you to solve problems and
answer questions that would have otherwise
been impossible
Acknowledgements
These slides were produced by Adam Carter
(EPCC, The University of Edinburgh) as part of the
EUDAT project (www.eudat.eu)
© 2014 The University of Edinburgh
You are welcome to re-use these slides under the terms of
CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/)