slides

Elements of Scale
Ben Stopford
•  App in a db story
•  The point is that our idea of a thing often
represents only a small proportion of that
thing’s place in the world.
•  It is the system in which it resides and the
interactions between elements that really
matter. The elements are second class
citizens. The intereactions. The systemic
effects. That’s where it’s at.
•  This talk is going to cover the key
elements for scaling data centric
systems.
•  We’ll start small, the elements and laws
of data. It is these that shape the large.
•  We’ll end with the large, because it’s the
large is where we will inevitably end up.
Intuitional building blocks
•  The balance between Random & Sequential access to data
•  Most data programming boils down to a mechanism for balancing
this tradeoff in some way.
•  When we think about this tradeoff we often think of disks. Most
people are familiar of the concept that disks work well
sequentially, but less well randomly.
•  Latency comes from two things
–  Physical distance -> this is a hardware limit
–  Cost to compose
•  Writes.
•  A low latency resource, a long way away, is never low latency.
•  Feedback cycles & concurrency
•  Bloom & CQRS
Memory access
•  Designed to be sequential. Precaching
actually slows down random reads from
memory. As datasets grow the number
of TLB misses increases nonlinarly
Write back caching
•  Cache lines are only sent to ram when
the line is evicted. There is a complicated
cache coherence protocol that marks
cache lines as one of four states.
Myths
•  Disk is slow. Memory is fast. (it’s more
complex)
•  Data centric applications are IO bound (the
good one’s aren’t)
http://www.eecs.berkeley.edu/~keo/
publications/nsdi15-final147.pdf
• 
• 
• 
• 
• 
For example, Spark [51] was demonstrated to be 20×
to 40× faster than Hadoop [51]. A close reading of that
paper illustrates that much of that improvement came not
from eliminating disk I/O, but from other improvements
over Hadoop, including eliminating serialization time.
•  Biggest bottleneck is perf. It is of course
people.
•  Scale out is good.
• 
• 
• 
• 
• 
• 
http://db.cs.berkeley.edu/papers/sigrec10-declimperative.pdf
In recent years, the exponentiation of Moore’s Law has brought the cost
of computational units so low that to infrastructure services they seem
almost free. For example, O’Malley and Murthy at Yahoo! reported sorting
a petabyte of data with Hadoop using a cluster of 3,800 machines each
with 8 proces- sor cores, 4 disks, and 8GB of RAM each [59]. That means
each core was responsible for sorting only about 32 MB (just 1/64th of
their available share of RAM!), while 3799/3800 of the petabyte was
passed through the cluster interconnection network during repartitioning. In
rough terms, they maxi- mized parallelism while ignoring resource
utilization. But if computation and communication are nearly free in
practice, what kind of complexity model captures the practical constraints of modern datacenters?
Took 16hrs. 3800x through the interconnects
Embarassingly parallel algorithms.
Hadoop 2013 100TB 2100 nodes, 12 cores (24 threads), 12 3TB sata. 10GB/s
network. 1.4 TB min. => 55 MB / core, 72 mins
Spark petasort 234mins 190 nodes 6080 cores.
In memory databases
•  Are not faster than disk based ones
because of the speed of disk. They are
faster because the data structures they
use do not have to be disk optimised.
Abadi paper
•  Redis is a good example.
Tradeoffs
•  Random Disk IO ~5ms per 1k block
•  Sequential Disk IO ~5us per 1k block
=> Disk is only really slow for random
operations.
Everything is slow for random
operations
•  Sequential/Random problem isn’t to do
with spinny disk. It’s everything. Network,
memory everything.
•  250MB/s how fast a machine can
process IO. Random memory access will
be less than this on a normal machine.
•  => we are stuck with the randomsequential problem on a variety of levels.
•  Much of what we do is to make random
stuff work sequentially
•  The trick is to not make sequential stuff
look random. E.g. kaffka
The log
•  Logs are great because they lack
random access. They operate at disk
speeds.
•  Only good for
Idea around messaging
•  Introduce the log
•  Riak is a set of logs with an in memory
index, just not arranged for sequential
scan.
•  A messaging system is a log with an in
memory index.
• 
• 
• 
• 
• 
• 
• 
• 
• 
Operation Cost (ns) Ratio
Clock period 0.6 1.0
Best-case CAS 37.9 63.2
Best-case lock 65.6 109.3
Single cache miss 139.5 232.5
CAS cache miss 306.0 510.0
Comms Fabric 3,000 5,000
Global Comms 130,000,000 216,000,000
https://www.kernel.org/pub/linux/kernel/people/
paulmck/perfbook/perfbook.2015.01.31a.pdf
•  p22
The Physical
The conceptual
•  Understanding data at scale
•  Terrible interface with data today.
•  Excel still one of the best forms because it
conflates data and funciton in a world with
immediate feedback.
•  But it’s flat L
•  Rows work well with our heads
•  Graphs, not so well. Composites work.
•  Data is a graph though so pretending it’s a row is
stupid
•  Dangers of aggregate and indirect metrics.
Row databases pull entire rows
Column databases can select a the time
only required rows Denormalisation
•  CQRS
Thinking in streams
•  Streams encourage stateless procesing
•  But stream processing normally has two
forms of input. The trigger and other data
needed to process the stream.
Sometimes this can be precached or
included in the message. Often it cannot.
•  x
All your data
Lambda Architecture
Query
Batch Layer
Serving
Layer
Query
Stream layer (fast)
All your data
Lambda Architecture
Query
Hadoop
Cassandra
Query
Kafka + Storm
- Cool architecture for use cases that cannot work in a single pass.
- General applicability limited by double-query & double-coding.
Kappa Architecture
All your data
Stream Processor
Views
Search
Client
Stream
NoSQL
Client
SQL
Kappa Architecture
All your data
Samza or
Storm
Views
Elastic
Search
Client
Kaffka
Cassandra
Client
Oracle
- Simpler choice where stream processors can handle full problem set
Operational /Analytic Bridge
All your data
Operational
Client
Stream Processor
Views
Search
Stream
NoSQL
SQL
Client
Client
Operational /Analytic Bridge
Operational
Client
Views
Search
Stream
NoSQL
SQL
Client
Client
Operational /Analytic Bridge
Coherence
Client
Views
Samza
All your data
Hadoop
Kaffka,
RabbitMQ
…
Cassandra,
MongoDB
Client
Oracle
Client
- Adds coordination layer needed for collaborative updates
Split O/A Bridge
Normalised, mutable ->
stream of versions
Immutable, denormalised
Hadoop, NoSQL,
Rel, Stream etc.
Async
Mixes all sources
Replication
streams
Sys 1
RW
Sys 2
Historic view
is mirrored
through stream
Client
RO
Client
RW
Sources have dedicated ‘buffers’ (isolation and local consistency)
Sys 3
Full history
Similar to Kappa except sources are part of the central infrastructure
Elements of Scale
Scaling Data-Centric
Applications
There are two primordial elements
•  Locality
•  Parallelism
Locality
•  Locality is how close you are to the data
you need.
•  It is also how localised that data is to
itself.
Pic
•  Disk based data, spread all over the
place
•  Memory based data spread all over the
place
•  Disk placed data, sequential
•  Memroy based data sequential
Paralellism
•  Paralellism is how much of your
computation can be executed at the
same time.
•  That’s means how parellisable your
computation is.
Pic
•  Perfecty (embarssingly) parallism, count,
sum, min, max etc
•  Paralisable tasks -> sorting
On to this we layer data structures
•  This provide cleverness. Cleverness lets
us manipulate the primordial elements to
our advantage.
Data structures allow us to
specialise
• 
This is important because there is no general index structure
• 
• 
• 
• 
Dictionaries specialise point lookups
Trees blend specificity with ordering
Inverted indexes where elements >> terms
Bitmap indexes are similar in this regard
Element 1:
Main memory may not be the
golden child you think it is
•  Main memory is a variable thing.
–  L1 cache
–  L2 cache
–  L3 cache
–  NUMA local
–  NUMA, non local
–  Sequential throughput
–  Random throughput
Programming on the JVM?
•  Object allocation in garbage collected
languages is generally not aligned.
•  Alignment is not guaranteed, even in
arrays.
•  Off heap structures can guarantee this,
but then you are serialising/deserialising.
Element 2:
Sequential access, over relevant
data, is always good
•  Sequential access means the hardware and
software can prefetch.
–  Disk buffer will prefectch
–  OS will prefetch into the file cache
–  Processor will prefetch into the L3/L2/L1 caches
•  Random access is bad less good
–  Prefetching will act against you, filling caches with
data you don’t need.
–  Memory access is relatively slow (~40ns). That’s 200
clock cycles. (or n L1 lookups)
–  > Random memory access will underperform
sequential disk in terms of throughput.
Picture comparing access times
So make everything sequential?
•  The problems you solve are not
sequential (normally)
•  Data structures turn searching into as few
non-sequential operations as possible.
Example 1: Binary search
Example 2: LSM tree
Example 3: Binary Tree
Example 4: Columnar
Exmaple 4: Inverted index
•  Turn search term to a list of references
Example 5: Bitmap
•  Functionally similar to an inverted index.
Structually differen though.
Why do we care?
Immutability
Element: Distribution
Data applications of the last ten
years have focused on solving
specific problems, in a hardware
sympathetic way, without the
bloat that comes with the
general case
Examples
•  Dynamo – linear scalability of hash based
access to distributed data
•  GFS, HDFS – Immutable. No random writes.
Immutability facilitates easy replication.
•  LSM – that writes dominate workloads. It’s
hard to write quickly to disk based trees.
•  Kafka – That the core of a messaging
system is really just a append only file.
Sequential access is fast.
--0-0-0-0-0-0-0-0-0--
There are many other, rich types
of index
• 
• 
• 
• 
Term index (inverted index)
Spatial index
Temporal index
Graph index
Drive Performance
• 
• 
• 
Sequential reads & writes ~ commodity network speeds (170MB/s). HDD/SSD have similar
performance.
Samsung XP941 (£300 0.5TB), Seagate Archive 8TB (£242), Western Digital Raptor 1TB (£178)
Seq read
– 
– 
– 
– 
• 
Seq write
– 
– 
– 
• 
Sam HDD ~ 190MB/s
Rap HDD ~ 203MB/s
SSD ~ 710 MB/s
Random read
– 
– 
– 
• 
Sam HDD ~ 190MB/s
Rap HDD ~ 203MB/s
SSD ~ 1,485 MB/s
RAM – 10,000 MB/s
Sam HDD ~ 300 reads/sec (4k IO/s) (1MB/s)
Rap HDD – 270 reads/sec (4k IO/s)
SSD ~ 80,000 reads/sec (4k IO/s)
(320MB/s)
Random write
– 
– 
– 
Sam HDD ~ 402 reads/sec (4k IO/s) (1MB/s)
Rap HDD ~ 366 (4k IO/s) (1MB/s)
SSD ~ 50,000 (4k IO/s) (200MB/s)
Benchmarks based on IOmeter. benchamark
(skip)When Memory is faster
• 
• 
• 
• 
Writes that don’t need durability
Data structures that need not be disk optimised
Avoid the cost of IO (locks & latches)
But often the utility of durability and larger
address spaces tips the balance.
•  Many in memory databases perform little better
than comparable disk based systems over
similarly sized datasets.
•  Network erodes this further
•  JVM Objects provide precious little locality in real
world applications.
Composite disk structures
• 
• 
• 
• 
Use multiple disks.
Fast SSD for indexes
Fast SSD for journal
Cheaper HDD for large storage.
Immutability
Immutability
• 
• 
• 
• 
Many workloads can avoid update in place.
Take for example marking a trade with a risk flag.
We could update the trade itself
Or we could take the trade, evaluate it, write
immutable stream of records mapping trade version
to risk flag.
•  The latter preserves single writer and avoids a
contended read-write view.
•  Obviously it lacks consistency, but that may not
matter.
•  Immutable worlds are easier to scale through
replication.
Immutable data makes scaling
easy
•  Consistency is an expensive thing.
•  Read replicas are easier to manage in an
immutable world.
•  Long history of this in data warehousing,
but without the focus on data immutability
Separating Parallelisable from
Sequential
Bloom etc.