DROSS Andrew Hardie ECPRD WGICT 17-21 November 2010

DROSS
Distributed & Resilient Open Source Software
Andrew Hardie
http://ashardie.com
ECPRD WGICT
17-21 November 2010
Chamber of Deputies, Bucharest
17-21 November 2010
ECPRD - WGICT - Bucharest
1
Topics

Distributed, not virtualized or ‘cloud’





DRBD
Gluster
Heartbeat
Nginx
Trends:
• NoSQL
• Map / Reduce
• Cassandra, Hadoop & family


Other stuff ‘out there’
Predictions…
17-21 November 2010
ECPRD - WGICT - Bucharest
2
DRBD

Block-level disk replicator (effectively, net RAID-1)
17-21 November 2010
ECPRD - WGICT - Bucharest
3
DRBD – Good/bad points











Good for HA clusters (e,g, LAMP servers)
Ideal for block-level apps, e.g. MySQL
Sync/Async operation
Auto recovery from disk, net or node failure
In Linux kernels from 2.6.33 (Ubuntu 10.10 is 2.6.35)
Supports Infiniband, LVM, XEN, Dual primary config
Hard to extend beyond two systems, three is maximum
Remote offsite really needs DRBD Proxy (commercial)
Requires dedicated disk/partition
Moderately difficult to configure
Documentation could be better
17-21 November 2010
ECPRD - WGICT - Bucharest
4
Gluster
 Filesystem-level
replicator
 More like NAS than RAID
 Claims to scale to petabytes
 Nodes can be servers, clients or both
 On the fly reconfig of disks & nodes
 Scripting interface
 ‘Cloud compliant’ (isn’t everything?)
17-21 November 2010
ECPRD - WGICT - Bucharest
5
Gluster – Use case - Dublin
Real-time mirroring of Digital Audio
17-21 November 2010
ECPRD - WGICT - Bucharest
6
Gluster – Good/bad points











Moving to “turnkey system” (black box)
N-way replication easy
Easier than DRBD to configure
Dedicated partitions or disks not required
Supports Infiniband
Background self-healing (pull rather than push)
Aggregate and/or replicate volumes
POSIX support
Native support for NFS, CIFS, HTTP & FTP
No specific features for slow link replication
Similar documentation vs revenue earning tension
17-21 November 2010
ECPRD - WGICT - Bucharest
7
Heartbeat
HA Cluster infrastructure (“cluster glue”)
 Needs Cluster Resource manager (CRM), e.g.
Pacemaker, to be useful
 Part of the Linux-HA project
 Provides:




hot-swap of synthetic IP address between nodes
(Synthetic IP is in addition to node’s own IPs)
Node failure/restore detection
Start/stop of services to be managed, via init scripts
17-21 November 2010
ECPRD - WGICT - Bucharest
8
Heartbeat/DRBD – use case
HA LAMP Server pair
17-21 November 2010
ECPRD - WGICT - Bucharest
9
Heartbeat – good/bad points

Lots of resource agents available

e.g. Apache, Squid, Sphinx search, VMWare, DB2,
WebSphere, Oracle, JBOSS, Tomcat, Postfix,
Informix, SAP, iSCSI, DRBD, …

Beyond simple 2-way hot-swap, config can get
very complicated
 Good for stateless (e.g. HTTP); not so good for
file shares (e.g. Samba)
 Documentation out of date in some areas, e.g.
Ububtu ‘upstart’ scripts (boot-time startup of services to be
managed by Heartbeat has to be disabled)
17-21 November 2010
ECPRD - WGICT - Bucharest
10
NGINX


Fast, simple Russian HTTP server
Reverse proxy server
 Mail proxy server
 Fast static content serving
 Very low memory footprint
 Load balancing and fault tolerance
 Name and IP based virtual servers
 Embedded Perl
 FLV streaming
 Non-threaded, event-driven architecture
 Modular architecture
 Can front-end Apache (instead of mod_proxy)
17-21 November 2010
ECPRD - WGICT - Bucharest
11
Trends – NoSQL, etc…

NoSQL


Or, is it really NoACID (atomicity, consistency, isolation,
durability)?
It’s really the ACID that’s hard to scale, esp. in the very large,
very active data stores (e.g. SN)
• Some NoSQLs now have SQL for query only
• Ways of solving ACID scalability being discussed

The problems:
• Huge numbers of simultaneous updates
• Large JOINs across very large tables (= big SQL query)
• Lots of updates & searches on small data elements in vast data sets

The alternative:
• Key/value stores
• De-normalized data
17-21 November 2010
ECPRD - WGICT - Bucharest
12
Consequences of De-normalizing

Order(s) of magnitude increase in storage
requirements
 Difficulty of updating numerous “Key
equivalents” in many places – can’t be done
synchronously
 Breaking relationship links allows parallel
processing:


helps the bottleneck of storage read speed (storage
capacity is growing much faster than transfer rates)
No JOINs or transactions
17-21 November 2010
ECPRD - WGICT - Bucharest
13
Name/Value Models
 Just
name/value pairs, e.g. memcachedb,
Dynamo
 Name/value pairs plus associated data,
e.g. CouchDB, MongoDB – think
document stores with metadata
 Name/value pairs with nesting, e.g.
Cassandra
17-21 November 2010
ECPRD - WGICT - Bucharest
14
Cassandra

Distributed, fault-tolerant database, based on
ideas in Dynamo (Amazon) & BigTable (Google)



Developed by FaceBook, open-sourced in 2008
Now Apache project
Key/value pairs, in column-oriented format
• Standard column: name, value, timestamp
• Super-column: name, map of columns, each with name,
value, timestamp (think array of hashes)
• Grouped by Column family, also either standard or super
• Column family contains ‘rows’, roughly like a DB table
• Column families then go in key-spaces
17-21 November 2010
ECPRD - WGICT - Bucharest
15
Cassandra - NoACID







Cassandra, et al, e.g. Voldemort (LinkedIn), trade speed,
distribution and availability for consistency and atomicity
No single point of failure
“Eventually consistent” model
Tunable levels of consistency
Atomicity only guaranteed within a column family
Accessed using Thrift (also developed by Facebook)
Used by:




Facebook
Digg
Twitter
Reddit
17-21 November 2010
ECPRD - WGICT - Bucharest
16
NoSQL for Parliaments?





Much parliamentary material is naturally unstructured
and suited to the name/value model (think XML)
Remember the old discussions about how to map such
parliamentary material into relational databases?
Think of every MPs contribution (speech) in chamber or
committee as a key/value pair, i.e. a column
Think of every PQ & answer as a super-column of
name/value pairs for question, answer, holding,
supplementary, pursuant, referral …
Hansard becomes a super-column family!
17-21 November 2010
ECPRD - WGICT - Bucharest
17
Map / Reduce

Column (or record) oriented design & de-normalized data power the
parallel “map reduce” model (think “sharding on speed”)
17-21 November 2010
ECPRD - WGICT - Bucharest
18
Hadoop








Nothing to do with NoSQL
Hadoop is an infrastructure and now family of tools for
managing distributed systems and immense datasets
How immense? Hundreds of GB and 10 node cluster is
‘entry-level’ in Hadoop terms
Developed by Yahoo for their cloud, now Apache project
Supports Map/Reduce by pre-dividing & distributing data
“Moves computation to the data instead of data to the
computation”
HDFS file system particularly interesting – distributed,
resilient (far more advanced than DRBD or Gluster), but
not real time (more eventually consistent…)
Hive data warehouse front end – has SQL-like queries
17-21 November 2010
ECPRD - WGICT - Bucharest
19
Who uses Hadoop?


Twitter
AOL
 IBM
 Last.fm
 LinkedIn
 E-Bay
 Yahoo



36,000 machines with > 100,000 cores running Hadoop
Largest cluster is only 4000 nodes
Largest known cluster is Facebook!


2000 machines with 22,400 cores
21Petabytes in a single HDFS store
17-21 November 2010
ECPRD - WGICT - Bucharest
20
Hadoop for Parliaments?
Hadoop may seem overkill for parliaments now…
 But, when you start your legacy collection digitization
and digital preservation projects its model, for managing
large datasets which essentially do not change & don’t
need real-time commit, is very good fit!
 Other interesting Hadoop projects:






Zookeeper (distributed apps co-ordination)
Hive (data warehouse infrastructure)
Pig (high-level data flow language)
Mahout (scalable machine learning library)
Scribe (for aggregating streaming log data) [not strictly Hadoop
project, but can be integrated with it, using interesting workaround for the non-real time & NameNode single point of failure]
17-21 November 2010
ECPRD - WGICT - Bucharest
21
Other things ‘out there’

Drizzle







A database “optimized for Cloud infrastructure and Web applications”
“Design for massive concurrency on modern multi-cpu architecture”
But, doesn’t actually explain how to use it for these…
It’s SQL and ACID
Mostly seems to be a reaction against what’s happening at MySQL…
Has to be compiled from source – no distros available for it yet
CouchDB








Distributed, fault-tolerant, schema-free document-oriented database
RESTful JSON API (i.e. Web front end)
Incremental replication with bi-directional conflict detection
Written in Erlang (highly reliable language developed by Ericsson)
Supports ‘map/reduce’ like querying and indexing
Interesting model, different from most other offerings
Also now an Apache project
Still too immature for anything beyond experimentation
17-21 November 2010
ECPRD - WGICT - Bucharest
22
Also ‘out there’

Voldemort





MonetDB




Another distributed key/value storage system
Used at LinkedIn
Doesn’t seem to have much future
Cassandra is similar, better & more widely used
“database system for high-performance applications in data mining,
OLAP, GIS, XML Query, text and multimedia retrieval “
SQL and XQUERY front ends
Also hard to see where it’s going…
MongoDB





Tries to bridge the gap between RDBMS and map/reduce
JSON document storage (like CouchDB)
No JOINs, no transactions
Supports atomic transactions only on single documents
Interesting, but may ‘fall between two stools’
17-21 November 2010
ECPRD - WGICT - Bucharest
23
Predictions






Hadoop and Cassandra are the ones to watch
There will likely be some sort of re-convergence between
NoSQL and query languages of some kind – can’t do
everything with map/reduce (esp. not ad hoc queries)
SQL may be destined to become like COBOL – still
around and running things but not something to use for
new projects
Distributed storage models (with or without map/reduce)
have good future
Datasets will only get bigger – compliance, audit, digital
preservation, the shift to visuals, etc
Information management models (“strategy”) and access
speed will remain key problems
17-21 November 2010
ECPRD - WGICT - Bucharest
24
Questions
“What’s it all about?” 
http://ashardie.com
17-21 November 2010
ECPRD - WGICT - Bucharest
25