Elements of Scale Ben Stopford • App in a db story • The point is that our idea of a thing often represents only a small proportion of that thing’s place in the world. • It is the system in which it resides and the interactions between elements that really matter. The elements are second class citizens. The intereactions. The systemic effects. That’s where it’s at. • This talk is going to cover the key elements for scaling data centric systems. • We’ll start small, the elements and laws of data. It is these that shape the large. • We’ll end with the large, because it’s the large is where we will inevitably end up. Intuitional building blocks • The balance between Random & Sequential access to data • Most data programming boils down to a mechanism for balancing this tradeoff in some way. • When we think about this tradeoff we often think of disks. Most people are familiar of the concept that disks work well sequentially, but less well randomly. • Latency comes from two things – Physical distance -> this is a hardware limit – Cost to compose • Writes. • A low latency resource, a long way away, is never low latency. • Feedback cycles & concurrency • Bloom & CQRS Memory access • Designed to be sequential. Precaching actually slows down random reads from memory. As datasets grow the number of TLB misses increases nonlinarly Write back caching • Cache lines are only sent to ram when the line is evicted. There is a complicated cache coherence protocol that marks cache lines as one of four states. Myths • Disk is slow. Memory is fast. (it’s more complex) • Data centric applications are IO bound (the good one’s aren’t) http://www.eecs.berkeley.edu/~keo/ publications/nsdi15-final147.pdf • • • • • For example, Spark [51] was demonstrated to be 20× to 40× faster than Hadoop [51]. A close reading of that paper illustrates that much of that improvement came not from eliminating disk I/O, but from other improvements over Hadoop, including eliminating serialization time. • Biggest bottleneck is perf. It is of course people. • Scale out is good. • • • • • • http://db.cs.berkeley.edu/papers/sigrec10-declimperative.pdf In recent years, the exponentiation of Moore’s Law has brought the cost of computational units so low that to infrastructure services they seem almost free. For example, O’Malley and Murthy at Yahoo! reported sorting a petabyte of data with Hadoop using a cluster of 3,800 machines each with 8 proces- sor cores, 4 disks, and 8GB of RAM each [59]. That means each core was responsible for sorting only about 32 MB (just 1/64th of their available share of RAM!), while 3799/3800 of the petabyte was passed through the cluster interconnection network during repartitioning. In rough terms, they maxi- mized parallelism while ignoring resource utilization. But if computation and communication are nearly free in practice, what kind of complexity model captures the practical constraints of modern datacenters? Took 16hrs. 3800x through the interconnects Embarassingly parallel algorithms. Hadoop 2013 100TB 2100 nodes, 12 cores (24 threads), 12 3TB sata. 10GB/s network. 1.4 TB min. => 55 MB / core, 72 mins Spark petasort 234mins 190 nodes 6080 cores. In memory databases • Are not faster than disk based ones because of the speed of disk. They are faster because the data structures they use do not have to be disk optimised. Abadi paper • Redis is a good example. Tradeoffs • Random Disk IO ~5ms per 1k block • Sequential Disk IO ~5us per 1k block => Disk is only really slow for random operations. Everything is slow for random operations • Sequential/Random problem isn’t to do with spinny disk. It’s everything. Network, memory everything. • 250MB/s how fast a machine can process IO. Random memory access will be less than this on a normal machine. • => we are stuck with the randomsequential problem on a variety of levels. • Much of what we do is to make random stuff work sequentially • The trick is to not make sequential stuff look random. E.g. kaffka The log • Logs are great because they lack random access. They operate at disk speeds. • Only good for Idea around messaging • Introduce the log • Riak is a set of logs with an in memory index, just not arranged for sequential scan. • A messaging system is a log with an in memory index. • • • • • • • • • Operation Cost (ns) Ratio Clock period 0.6 1.0 Best-case CAS 37.9 63.2 Best-case lock 65.6 109.3 Single cache miss 139.5 232.5 CAS cache miss 306.0 510.0 Comms Fabric 3,000 5,000 Global Comms 130,000,000 216,000,000 https://www.kernel.org/pub/linux/kernel/people/ paulmck/perfbook/perfbook.2015.01.31a.pdf • p22 The Physical The conceptual • Understanding data at scale • Terrible interface with data today. • Excel still one of the best forms because it conflates data and funciton in a world with immediate feedback. • But it’s flat L • Rows work well with our heads • Graphs, not so well. Composites work. • Data is a graph though so pretending it’s a row is stupid • Dangers of aggregate and indirect metrics. Row databases pull entire rows Column databases can select a the time only required rows Denormalisation • CQRS Thinking in streams • Streams encourage stateless procesing • But stream processing normally has two forms of input. The trigger and other data needed to process the stream. Sometimes this can be precached or included in the message. Often it cannot. • x All your data Lambda Architecture Query Batch Layer Serving Layer Query Stream layer (fast) All your data Lambda Architecture Query Hadoop Cassandra Query Kafka + Storm - Cool architecture for use cases that cannot work in a single pass. - General applicability limited by double-query & double-coding. Kappa Architecture All your data Stream Processor Views Search Client Stream NoSQL Client SQL Kappa Architecture All your data Samza or Storm Views Elastic Search Client Kaffka Cassandra Client Oracle - Simpler choice where stream processors can handle full problem set Operational /Analytic Bridge All your data Operational Client Stream Processor Views Search Stream NoSQL SQL Client Client Operational /Analytic Bridge Operational Client Views Search Stream NoSQL SQL Client Client Operational /Analytic Bridge Coherence Client Views Samza All your data Hadoop Kaffka, RabbitMQ … Cassandra, MongoDB Client Oracle Client - Adds coordination layer needed for collaborative updates Split O/A Bridge Normalised, mutable -> stream of versions Immutable, denormalised Hadoop, NoSQL, Rel, Stream etc. Async Mixes all sources Replication streams Sys 1 RW Sys 2 Historic view is mirrored through stream Client RO Client RW Sources have dedicated ‘buffers’ (isolation and local consistency) Sys 3 Full history Similar to Kappa except sources are part of the central infrastructure Elements of Scale Scaling Data-Centric Applications There are two primordial elements • Locality • Parallelism Locality • Locality is how close you are to the data you need. • It is also how localised that data is to itself. Pic • Disk based data, spread all over the place • Memory based data spread all over the place • Disk placed data, sequential • Memroy based data sequential Paralellism • Paralellism is how much of your computation can be executed at the same time. • That’s means how parellisable your computation is. Pic • Perfecty (embarssingly) parallism, count, sum, min, max etc • Paralisable tasks -> sorting On to this we layer data structures • This provide cleverness. Cleverness lets us manipulate the primordial elements to our advantage. Data structures allow us to specialise • This is important because there is no general index structure • • • • Dictionaries specialise point lookups Trees blend specificity with ordering Inverted indexes where elements >> terms Bitmap indexes are similar in this regard Element 1: Main memory may not be the golden child you think it is • Main memory is a variable thing. – L1 cache – L2 cache – L3 cache – NUMA local – NUMA, non local – Sequential throughput – Random throughput Programming on the JVM? • Object allocation in garbage collected languages is generally not aligned. • Alignment is not guaranteed, even in arrays. • Off heap structures can guarantee this, but then you are serialising/deserialising. Element 2: Sequential access, over relevant data, is always good • Sequential access means the hardware and software can prefetch. – Disk buffer will prefectch – OS will prefetch into the file cache – Processor will prefetch into the L3/L2/L1 caches • Random access is bad less good – Prefetching will act against you, filling caches with data you don’t need. – Memory access is relatively slow (~40ns). That’s 200 clock cycles. (or n L1 lookups) – > Random memory access will underperform sequential disk in terms of throughput. Picture comparing access times So make everything sequential? • The problems you solve are not sequential (normally) • Data structures turn searching into as few non-sequential operations as possible. Example 1: Binary search Example 2: LSM tree Example 3: Binary Tree Example 4: Columnar Exmaple 4: Inverted index • Turn search term to a list of references Example 5: Bitmap • Functionally similar to an inverted index. Structually differen though. Why do we care? Immutability Element: Distribution Data applications of the last ten years have focused on solving specific problems, in a hardware sympathetic way, without the bloat that comes with the general case Examples • Dynamo – linear scalability of hash based access to distributed data • GFS, HDFS – Immutable. No random writes. Immutability facilitates easy replication. • LSM – that writes dominate workloads. It’s hard to write quickly to disk based trees. • Kafka – That the core of a messaging system is really just a append only file. Sequential access is fast. --0-0-0-0-0-0-0-0-0-- There are many other, rich types of index • • • • Term index (inverted index) Spatial index Temporal index Graph index Drive Performance • • • Sequential reads & writes ~ commodity network speeds (170MB/s). HDD/SSD have similar performance. Samsung XP941 (£300 0.5TB), Seagate Archive 8TB (£242), Western Digital Raptor 1TB (£178) Seq read – – – – • Seq write – – – • Sam HDD ~ 190MB/s Rap HDD ~ 203MB/s SSD ~ 710 MB/s Random read – – – • Sam HDD ~ 190MB/s Rap HDD ~ 203MB/s SSD ~ 1,485 MB/s RAM – 10,000 MB/s Sam HDD ~ 300 reads/sec (4k IO/s) (1MB/s) Rap HDD – 270 reads/sec (4k IO/s) SSD ~ 80,000 reads/sec (4k IO/s) (320MB/s) Random write – – – Sam HDD ~ 402 reads/sec (4k IO/s) (1MB/s) Rap HDD ~ 366 (4k IO/s) (1MB/s) SSD ~ 50,000 (4k IO/s) (200MB/s) Benchmarks based on IOmeter. benchamark (skip)When Memory is faster • • • • Writes that don’t need durability Data structures that need not be disk optimised Avoid the cost of IO (locks & latches) But often the utility of durability and larger address spaces tips the balance. • Many in memory databases perform little better than comparable disk based systems over similarly sized datasets. • Network erodes this further • JVM Objects provide precious little locality in real world applications. Composite disk structures • • • • Use multiple disks. Fast SSD for indexes Fast SSD for journal Cheaper HDD for large storage. Immutability Immutability • • • • Many workloads can avoid update in place. Take for example marking a trade with a risk flag. We could update the trade itself Or we could take the trade, evaluate it, write immutable stream of records mapping trade version to risk flag. • The latter preserves single writer and avoids a contended read-write view. • Obviously it lacks consistency, but that may not matter. • Immutable worlds are easier to scale through replication. Immutable data makes scaling easy • Consistency is an expensive thing. • Read replicas are easier to manage in an immutable world. • Long history of this in data warehousing, but without the focus on data immutability Separating Parallelisable from Sequential Bloom etc.
© Copyright 2025