NOT IN KANSAS ANY MORE How we moved into “Big Data” Dan Taylor - JDSU www.arieso.com Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I’m currently lucky enough to manage the Data Science team in JDSU’s Location Intelligence business unit, where boredom is seldom an issue. arieso.com Page 2 jdsu.com Copyright © JDSU 2015 logicalgenetics.com www.arieso.com JDSU – Arieso Location Intelligence Arieso’s solutions harness the power of customer generated, geolocated intelligence to tackle some of the biggest challenges facing mobile operators today. ariesoGEO locates and analyses data from billions of mobile connection events, giving operators a rich source of intelligence to help boost network performance and enrich user experience. Page 3 Copyright © JDSU 2015 www.arieso.com JDSU – Arieso Location Intelligence Page 4 Copyright © JDSU 2015 www.arieso.com THE MOTHER OF INVENTION www.arieso.com Necessity… New Customer! • Monitor LTE network for entire USA • Geolocate every event within 2 minutes • Process and store 34+ billion calls daily • 24x7 Operation • Build in the flexibility to grow with the network Page 6 Copyright © JDSU 2015 www.arieso.com Daily Data Volumes 100TB data storage Copyright © JDSU 2015 www.arieso.com Daily Data Size Comparison Facebook hits: 100 billion LTE Connections in new GEO: 34 billion Calls in GEO globally today: 12 billion Google searches: 4 billion Tweets: 500 million LinkedIn page views: 47 million Copyright © JDSU 2015 www.arieso.com A Traditional Geo System LOADERS APP SERVERS ORACLE DB SERVER CALL TRACE DATA ANALYSES Copyright © JDSU 2015 www.arieso.com Lambda Architecture Page 10 Streaming Batch “Real Time” Processing Historical Processing Copyright © JDSU 2015 www.arieso.com STREAMING www.arieso.com Streaming Technology Ecosystem Page 12 Copyright © JDSU 2015 www.arieso.com Tools – Storm “…doing for realtime processing what Hadoop did for batch processing” • Distributed computing platform • Manages execution and message transport between processing blocks • Free and open source • Developed by Twitter to process high volumes of event data • High performance; Fault tolerant • Windows and Linux support Copyright © JDSU 2015 www.arieso.com Tools – Storm Cluster Bolt Bolt Spout Bolt Spout Nimbus Copyright © JDSU 2015 www.arieso.com Tools - Kafka Apache Kafka is publish-subscribe messaging rethought as a distributed commit log • Distributed and fault tolerant messaging system • Scalable and Durable • Guaranteed delivery • Built in Scala, Windows and Linux compatible • Originally developed by LinkedIn Page 15 Copyright © JDSU 2015 www.arieso.com Tools - Kafka Partition A - Carolinas 1 2 3 4 5 6 7 6 7 6 7 8 9 10 Partition B - North Texas 1 2 3 4 5 Writes Partition C - Los Angeles 1 2 3 4 5 8 9 • Each partition is an ordered, immutable sequence of messages • All messages are persisted for a configurable time Page 16 Copyright © JDSU 2015 www.arieso.com Tools - Kafka Partition A - Carolinas 1 2 3 4 5 6 7 8 9 10 Read(4) Read(5) Read(6) • Consumers read at a specified offset • It is up to the consumer to manage the offset – reads don’t have to be sequential Page 17 Copyright © JDSU 2015 www.arieso.com Tools - Redis • High Speed, In Memory Key-Value store • Master/Slave Replication Support • Primarily supports Linux, 64-bit Windows build maintained by Microsoft Copyright © JDSU 2015 www.arieso.com Streaming Architecture - Have It Your Way Application teams choose their favourite fillings for their own custom burger Our platform provides the bun! Copyright © JDSU 2015 www.arieso.com Streaming Architecture (Simplified… Lots) Control Framework (Storm) Streaming Location Feed Quick Geolocator CTR 2 minute world Parser Bridge CTUM Intensive Geolocator Identity Matrix Distributed Services (Redis) 15 minute world Data Loader Network Service Copyright © JDSU 2015 Geo (x28) www.arieso.com BATCH PROCESSING www.arieso.com Batch Processing Technology Ecosystem Page 22 Copyright © JDSU 2015 www.arieso.com Hbase “Use Apache HBase when you need random, realtime read/write access to your Big Data” “Hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware” Copyright © JDSU 2015 www.arieso.com Hbase – Freeform App Schemas Application Thing Loading Thing Data Store Page 24 Copyright © JDSU 2015 www.arieso.com Hbase – It’s not relational! Cell Name:string PSC:int Cpich:float RNC_ID:int RNC M 1 Name:string MCC:int MNC:int Copyright © JDSU 2015 www.arieso.com Hbase – It’s not relational! • Structure tables to suite the Cell CellName:string PSC:int Cpich:float RNCName:string MCC:int MNC:int application, not the database • Stop thinking in 3rd Normal Form • Don’t worry about duplication or repetition • Big wide tables are the way to do it Copyright © JDSU 2015 www.arieso.com Hbase – Cell Versioning QUERY @ 03:15:00 Stilton 03:00:00 QUERY @ 02:34:00 31.6 145 02:00:00 166 01:30:00 01:00:00 QUERY @ 01:45:00 Cell_A 30.0 187 Cell CpichPower PSC Copyright © JDSU 2015 Cheese www.arieso.com Spark “a fast and general engine for largescale data processing” • Distributed, general purpose data processing • Scala or Java Development • Execute analyses next to the data Page 28 Copyright © JDSU 2015 www.arieso.com THE END! Page 29 Copyright © JDSU 2015 www.arieso.com Spark Example // Setup the spark context // (the connection and environment for the job) val sc = new SparkContext(new SparkConf() .setAppName("TaxiFraudsters") .setMaster("local[*]")) // Data cleansing - lots of dodgy lats and longs in the files val newYorkCity = (-73.979681, 40.7033127) val cleanedTrips = trips .filter(trip => trip.pickupLocation.getDistanceTo(newYorkCity) < 100) .filter(trip => trip.dropoffLocation.getDistanceTo(newYorkCity) < 100) // Load the data into a friendly class val tripRows = sc.textFile("D:/Data/NY Taxi/trip_data_*.csv").parseCsv() // Find the difference between reported and straight-line distance val tripDistances = cleanedTrips .map(trip => (trip, trip.pointToPointDistance - trip.reportedDistance)) .filter({case (trip, difference) => difference > trip.reportedDistance / 10.0}) val trips = tripRows.map(row => new Trip ( row("medallion"), row("trip_distance").toDouble, (row("pickup_longitude").toDouble, row("pickup_latitude").toDouble), (row("dropoff_longitude").toDouble, row("dropoff_latitude").toDouble) ) ) // Total unaccounted for hours per medallion val fraudsters = tripDistances .map({case (trip, difference) => (trip.medallion, difference)}) .reduceByKey(_ + _) // Find the naughtiest drivers val sorted = fraudsters .map({case (key, data) => (data, key)}).sortByKey(ascending = false) // print the result - nothing is executed until now sorted.take(10).foreach(println) Copyright © JDSU 2015 www.arieso.com
© Copyright 2024