Bringing Compute to the Data Alternatives to Moving Data Part of EUDAT’s Training in the Fundamentals of Data Infrastructures Introduction • Why consider alternatives? • The “traditional” approach • Alternative approaches: – Distributed Computing – Workflows – Bringing the Compute to the Data Why should alternative approaches be considered? • Moving data is still hard, even when you’re using the right tools. • Data volumes are expected to continue to increase, and this is expected to happen more rapidly then increases in transfer speeds • Alternatives require thinking about things differently, so it may be wise to start thinking about alternatives before current techniques break down “Traditional” Approach • Input data is stored at location A • Compute resource is at location B • Output data is required at location C 1. Move data from A to B 2. Perform computation at B 3. Move data from B to C A B (A & C are often the same place) C “Traditional” approach Data Compute Alternative Approaches: A Disclaimer • None of the following approaches provide a silver bullet! • Not all approaches will be useful for all problems – and in some case, using these approaches can make things worse • These should complement existing approaches and be used where appropriate Distributed Computing • Here, the idea is that you might not need to do all of the compute at B. • In general, this approach could make things worse, depending on your data transfer pattern • It will not be suitable for all kinds of problem • Many of the considerations here are traditional “parallel computing” concepts Distributed Computing as Parallel Computing • Is the problem “trivially parallel”? Is it possible to solve parts of the problem using only part of the input data, and simply recombine the output at the end of a run? • If all “processors” have access to all the data at the start, is it then possible for them to proceed with little or no communication during the runs? • If there is the need to communicate during a run, how intensive are these communications? Do you have all-to-alls? When might Distributed Computing be a good alternative? • When input data starts off distributed – Fairly common with large scale experimental data: • Sensors, detectors, etc. – When input data is already mirrored – When you’ve had to move the data before anyway and you could have moved it to multiple places instead of just one • When the computation is trivially parallelisable or requires only limited communication A B1 B2 B3 B4 C A1 A2 A3 A4 B1 B2 B3 B4 C A1 A2 A3 A4 B1 B2 B3 B4 C A1 B1 B2 C A1 A2 B1 B2 B3 C Is this “Grid Computing”? • There are definite overlaps between these ideas of distributed computing and the “grid computing” that promised so much in the last decade… • Grid is not such a “cool” topic anymore, but many of the ideas could be reused in different contexts (possibly hidden from an end-user) • This way of computing may still come into its own for certain kinds of big data problems Scientific Computation “in the cloud”? • Likely to be a while before this can get close to existing approaches in terms of efficiency, but it is being used in some places – e.g. Amazon has “Cluster Compute” and “Cluster GPU” instances (see http://aws.amazon.com/hpc-applications/) • Some data sets are already “in the cloud”, e.g. – Annotated Human Genome Data provided by ENSEMBL – Various US Census Databases from The US Census Bureau – UniGene provided by the National Center for Biotechnology Information – Freebase Data Dump from Freebase.com Big Input Data • Likely to become more common as more and more data is stored and available for re-use • Projects like EUDAT will make it easier to access to stored data • This will be the case for much data-intensive science – Where here I use this term in the context of “the fourth paradigm”: computers as datascopes Workflows • Related to distributed computing • Sometimes referred to as “programming in the large” • Again, this potentially requires more data movement • The idea is to break the computation down so that some of it can be done at A, some of it can be done at B, and some of it can be done at C. • Also, instead of doing everything at B, this could instead be done at B1, B2, B3, B4, … Simple Motivating Example Big Input Data A B Small Output Data C A A B1 B B2 C C or a more realistic case? Image Source: http://pegasus.isi.edu Difficulties with this approach • Change to computation algorithm likely – A trade-off, but it might only need to be done once… • Orchestration – Coordinating computation at multiple sites – Workflows can help with this • Can help to address the added complexities of – Multiple “jurisdictions” / access policies – Job scheduling – Automation Approaches to orchestration • Local – Each compute service works independently – Data can be pushed or pulled between services (or some combination) – The route that the data should take can be • passed with the data • predetermined at the service • communicated manually to the service for each “run” • Orchestrated – The usual workflow approch – A workflow engine communicates with services or processing elements to control data flow An aside: Push & Pull Push • Service 1 completes processing. • Service 1 makes a call to service 2 and sends the data to service 2 • The arrival of data triggers service 2 to run Pull • Service 1 runs and stores its output locally • Service 2 runs (triggered manually) • Service 2 initiates data transfer from service 1 Service 1 Service 1 Service 2 Service 2 Workflow Engines • Scientific Workflows – Kepler, Taverna, Triana, Pegasus (Condor), VisTrails – Unicore, OGSA-DAI (for database-oriented flows) • General Purpose / Business Orientated – Service Oriented Architecture Solutions – BPEL engines, e.g., • Oracle BPEL Process Manager • SAP Exchange Infrastructure • WebSphere Process Server – Many of these based on web services • Datacentre orientated – Hadoop (MapReduce), Storm (stream processing) Moving the Compute to the Data • A more general “idea” which is related to both the previous approaches • This approach relies to some extent on having an infrastructure that supports this approach • Can work particularly well where A and C are the same place Computing Close To The Data • Relational Database Systems – Send a query as SQL • Virtual Machines – Send a VM image to a virtualisation environment on a machine which can directly mount the data • Allow a user to submit a script or executable on a machine close to the data • SPARQL endpoints on RDF triple stores • Data Services (e.g. as Web Services) with some API beyond file transfer – Prefiltering / transformation / subsetting – Application As A Service Implications for Data Centres • These approaches rely on data centres to provide computational resources and services • Cons: – Interface required to accept query or compute job – Compute/processing resources required • Pros: – Less strain on the network Conclusions • Data movement will always be required • Moving large amounts of data is never likely to be easy • There is not one single solution, but by considering alternative approaches to big data problems may help you to solve problems and answer questions that would have otherwise been impossible Acknowledgements These slides were produced by Adam Carter (EPCC, The University of Edinburgh) as part of the EUDAT project (www.eudat.eu) © 2014 The University of Edinburgh You are welcome to re-use these slides under the terms of CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/)