Cost Effective Computing using Statistics

Cost Effective Computing using Statistics
Who are we
● SigmaStream
● Fully Apache licensed open source Domain specific DSL for data analytics:
Adtech, IOT
● Libraries for Geo, Timeseries, Online Machine learning and more
● Extendable framework on top of Spark & Spark Streaming jobs
● Compose custom pipelines using domain specific packages
Case Study: Data Processing
Load Customer
Data
Filter Customers
Count Unique
Customers
Update most
active Customers
Check if
customer already
exist
Send out Email
Recommend
news based on
similar customers
Sigma Stream Script
val customers= LOAD "hdfs://HNode:9000/input/" separatedBy ","
fields (customer_id:int, location:geo)
val filteredCustomer= FILTER customers by userid regex “a%”
val groupedCustomers = GROUP filterCustomer by userid
val UniqueCustomers = GENERATE groupedCustomers AS
Unique(groupedCustomers)
Val store = UniqueCustomers STORE "hdfs://HNode:9000/output/” AS
json
Previous World (1-2 Months)
Ask for
Machines
Machines
Allocated by IT
IT Team takes
2 months to
install software
Data
processing
pipeline
Previous World(4-5 Months)
Ask for
Machines
Capacity
overshoot
Data processing
pipeline
Machines
Allocated by IT
IT Team takes 2
months to install
software
Current World
Ask for More
Machines
Capacity
overshoot
Upside
•
•
•
•
•
On-demand machine request
Scale up & down as required, based on data
Spot allows for more cost explorations
Fault tolerance translating to $$
Developer has much broader control
Lets Optimize
Load Data
Filter Customers
Count Unique
Customers
Update most
active Customers
Check if
customer already
exist
Send out Email
Disk
IO
Recommend
news based on
similar customers
Disk IO
•
•
•
•
Spark provides in-memory Caching out of box
High beneficial for multiple iterations of data
Extendable off-heap storage in Tachyon
10x improvement off Magnetic disk
What about the Rest?
Load Data
O(N)
Filter Customers
Update most
active Customers
Disk
IO
Recommend O(N ^ 2)
news based on
similar customers
O(N log N)
Count Unique
Customers
O(N log N)
Check if
customer already
exist
O(N log N)
Send out Email
Computational Complexity
• Complexity translates to Time to process.
• O(N) : Touching 1TB of data in distributed
environment can cost ~ $ 100
• O(N ^ 2) : Translates to as high as $10 000
• Too much memory/disk requirements
• Too much time taken for real life datasets
• Cost escalates beyond the value of analytics
Lets Optimize
Load Data
O(N)
Filter Customers
Update most
active Customers
In-memory
Cache
Recommend O(N ^ 2)
news based on
similar customers
O(N log N)
Count Unique
Customers
O(N log N)
Check if
customer already
exist
O(N log N)
Send out Email
Much ado about Sets
Load Data
O(N)
Filter Customers
Rank on Set
Cardinality
In-memory
Cache
Find similar Sets
O(N log N)
Set Count
O(N log N)
Set Membership
O(N log N)
Send out Email
O(N ^ 2)
Base DataStructure: Monoids




Mathematical concept ( A Set & Function)
Contain a “0”
Can be “added”
Are Associative in Addition
Base DataStructure: Monoids

Simplest Monoid Structure :

+
Integer Set
0
(a+b)+c = a + (b + c)

Not a Monoid







/
Integers
1
(a/b)/c != a/(b/c)
Bloom Filter
A Bloom filter is a data structure designed to tell you,
rapidly and memory-efficiently, whether an element is
present in a set. It is greatly suitable in cases when
we need to quickly filter items which are present in a
set.
Bloom Filter
Maybe Yes or Surely No
HyperLogLog
● Hyperloglog is an approximate technique for
computing the number of distinct entries in a set
(cardinality). It does this while using a small
amount of memory. For instance, to achieve 99%
accuracy, it needs only 16 KB.
HyperLogLog
HyperLogLog
HyperLogLog
Count-Min-Sketch
● The Count–min sketch (or CM sketch) is a probabilistic sub-linear
space streaming algorithm which can be used to summarize a data
stream in many different ways.
Count Min Sketch
Solutions
Specific algorithms for specific usecases
• Set membership: Bloom Filter
• Set Size: HyperLogLog
• Set Similarity: Min Hash
• Set Cardinality: CountMinHash
100x Faster & bounded error
Load Data
O(N)
Filter Customers
In-Memory
Update most
active Customers
Recommend Min Hash
news based on
similar customers
Count Min Hash
Count Unique
Customers
HyperLogLog
Check if
customer already
exist
Bloom Filter
Send out Email
Algorithm Implementations
Algebird by Twitter
●
●
●
●
Embedded in Scalding
Embedded in SigmaStream DSL
Examples in Spark
https://github.com/twitter/algebird
Additional Features
●
●
●
●
●
Many other algorithms
Better Datastructures that capture result & Probability of Error
Switchable light memory Hash implementations
Fully composable Functional API
Configurable memory for lower error
Results
● 1TB of data in Hadoop
● Spot Instances in EMR
● ~ $1000 USD /TB for calculating uniques
● 1TB of data in Spark
●
●
Spot Instances in EMR
~ $200 USD /TB for calculating uniques using HyperLogLog with .2% error
Whats Next for Sigmoid
SigmaStream
●
●
●
●
●
Fully opensource Domain specific DSL for data analytics: Adtech, IOT
Higher level functions for domains specific logic
Uses linear algebra as base operators to build domain specific functions
Libraries for Geo, Timeseries, Online Machine learning and more
Extendable framework on top of Spark & Spark Streaming jobs
CloudFlux
●
●
●
●
Scalable cloud agnostic deployment framework based on Apache Mesos
Resource & latency based cluster scaling
Cost based optimization for cluster deployment
Cloud & Cluster management united
Questions