NOT IN KANSAS ANY MORE

NOT IN KANSAS ANY MORE
How we moved into “Big Data”
Dan Taylor - JDSU
www.arieso.com
Dan Taylor
Dan Taylor: An Engineering Manager, Software
Developer, data enthusiast and advocate of all
things Agile.
I’m currently lucky enough to manage the Data Science
team in JDSU’s Location Intelligence business unit, where
boredom is seldom an issue.
arieso.com
Page 2
jdsu.com
Copyright © JDSU 2015
logicalgenetics.com
www.arieso.com
JDSU – Arieso Location Intelligence
Arieso’s solutions harness the power of customer generated,
geolocated intelligence to tackle some of the biggest
challenges facing mobile
operators today.
ariesoGEO locates and analyses data from billions of mobile
connection events, giving operators a rich source of intelligence
to help boost network performance and enrich user experience.
Page 3
Copyright © JDSU 2015
www.arieso.com
JDSU – Arieso Location Intelligence
Page 4
Copyright © JDSU 2015
www.arieso.com
THE MOTHER OF INVENTION
www.arieso.com
Necessity… New Customer!
• Monitor LTE network for entire USA
• Geolocate every event within 2 minutes
• Process and store 34+ billion calls daily
• 24x7 Operation
• Build in the flexibility to grow with the network
Page 6
Copyright © JDSU 2015
www.arieso.com
Daily Data Volumes
100TB data
storage
Copyright © JDSU 2015
www.arieso.com
Daily Data Size Comparison
Facebook hits:
100 billion
LTE Connections in new GEO:
34 billion
Calls in GEO globally today:
12 billion
Google searches:
4 billion
Tweets:
500 million
LinkedIn page views:
47 million
Copyright © JDSU 2015
www.arieso.com
A Traditional Geo System
LOADERS
APP SERVERS
ORACLE DB SERVER
CALL TRACE DATA
ANALYSES
Copyright © JDSU 2015
www.arieso.com
Lambda Architecture
Page 10
Streaming
Batch
“Real Time” Processing
Historical Processing
Copyright © JDSU 2015
www.arieso.com
STREAMING
www.arieso.com
Streaming Technology Ecosystem
Page 12
Copyright © JDSU 2015
www.arieso.com
Tools – Storm
“…doing for realtime processing what
Hadoop did for batch processing”
• Distributed computing platform
• Manages execution and message transport between processing blocks
• Free and open source
• Developed by Twitter to process high volumes of event data
• High performance; Fault tolerant
• Windows and Linux support
Copyright © JDSU 2015
www.arieso.com
Tools – Storm
Cluster
Bolt
Bolt
Spout
Bolt
Spout
Nimbus
Copyright © JDSU 2015
www.arieso.com
Tools - Kafka
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log
• Distributed and fault tolerant messaging system
• Scalable and Durable
• Guaranteed delivery
• Built in Scala, Windows and Linux compatible
• Originally developed by LinkedIn
Page 15
Copyright © JDSU 2015
www.arieso.com
Tools - Kafka
Partition A - Carolinas
1
2
3
4
5
6
7
6
7
6
7
8
9
10
Partition B - North Texas
1
2
3
4
5
Writes
Partition C - Los Angeles
1
2
3
4
5
8
9
• Each partition is an ordered, immutable sequence of messages
• All messages are persisted for a configurable time
Page 16
Copyright © JDSU 2015
www.arieso.com
Tools - Kafka
Partition A - Carolinas
1
2
3
4
5
6
7
8
9
10
Read(4)
Read(5)
Read(6)
• Consumers read at a specified offset
• It is up to the consumer to manage the offset – reads don’t have to be
sequential
Page 17
Copyright © JDSU 2015
www.arieso.com
Tools - Redis
• High Speed, In Memory Key-Value store
• Master/Slave Replication Support
• Primarily supports Linux, 64-bit Windows build maintained by
Microsoft
Copyright © JDSU 2015
www.arieso.com
Streaming Architecture - Have It Your Way
Application teams
choose their
favourite fillings for
their own custom
burger
Our platform
provides the bun!
Copyright © JDSU 2015
www.arieso.com
Streaming Architecture (Simplified… Lots)
Control Framework (Storm)
Streaming
Location
Feed
Quick
Geolocator
CTR
2 minute
world
Parser
Bridge
CTUM
Intensive
Geolocator
Identity
Matrix
Distributed Services (Redis)
15 minute
world
Data Loader
Network
Service
Copyright © JDSU 2015
Geo (x28)
www.arieso.com
BATCH PROCESSING
www.arieso.com
Batch Processing Technology Ecosystem
Page 22
Copyright © JDSU 2015
www.arieso.com
Hbase
“Use Apache HBase when you need random,
realtime read/write access to your Big Data”
“Hosting of very large tables - billions of rows X
millions of columns - atop clusters of commodity
hardware”
Copyright © JDSU 2015
www.arieso.com
Hbase – Freeform App Schemas
Application
Thing
Loading
Thing
Data Store
Page 24
Copyright © JDSU 2015
www.arieso.com
Hbase – It’s not relational!
Cell
Name:string
PSC:int
Cpich:float
RNC_ID:int
RNC
M
1
Name:string
MCC:int
MNC:int
Copyright © JDSU 2015
www.arieso.com
Hbase – It’s not relational!
• Structure tables to suite the
Cell
CellName:string
PSC:int
Cpich:float
RNCName:string
MCC:int
MNC:int
application, not the database
• Stop thinking in 3rd Normal Form
• Don’t worry about duplication or
repetition
• Big wide tables are the way to do
it
Copyright © JDSU 2015
www.arieso.com
Hbase – Cell Versioning
QUERY @ 03:15:00
Stilton
03:00:00
QUERY @ 02:34:00
31.6
145
02:00:00
166
01:30:00
01:00:00
QUERY @ 01:45:00
Cell_A
30.0
187
Cell
CpichPower
PSC
Copyright © JDSU 2015
Cheese
www.arieso.com
Spark
“a fast and general engine for largescale data processing”
• Distributed, general purpose data processing
• Scala or Java Development
• Execute analyses next to the data
Page 28
Copyright © JDSU 2015
www.arieso.com
THE END!
Page 29
Copyright © JDSU 2015
www.arieso.com
Spark Example
// Setup the spark context
// (the connection and environment for the job)
val sc = new SparkContext(new SparkConf()
.setAppName("TaxiFraudsters")
.setMaster("local[*]"))
// Data cleansing - lots of dodgy lats and longs in the files
val newYorkCity = (-73.979681, 40.7033127)
val cleanedTrips = trips
.filter(trip => trip.pickupLocation.getDistanceTo(newYorkCity) < 100)
.filter(trip => trip.dropoffLocation.getDistanceTo(newYorkCity) < 100)
// Load the data into a friendly class
val tripRows = sc.textFile("D:/Data/NY
Taxi/trip_data_*.csv").parseCsv()
// Find the difference between reported and straight-line distance
val tripDistances = cleanedTrips
.map(trip => (trip, trip.pointToPointDistance - trip.reportedDistance))
.filter({case (trip, difference) => difference > trip.reportedDistance / 10.0})
val trips = tripRows.map(row => new Trip
(
row("medallion"),
row("trip_distance").toDouble,
(row("pickup_longitude").toDouble,
row("pickup_latitude").toDouble),
(row("dropoff_longitude").toDouble,
row("dropoff_latitude").toDouble)
)
)
// Total unaccounted for hours per medallion
val fraudsters = tripDistances
.map({case (trip, difference) => (trip.medallion, difference)})
.reduceByKey(_ + _)
// Find the naughtiest drivers
val sorted = fraudsters
.map({case (key, data) => (data, key)}).sortByKey(ascending = false)
// print the result - nothing is executed until now
sorted.take(10).foreach(println)
Copyright © JDSU 2015
www.arieso.com