Building a Real-Time Data Pipeline from Batch Scott Brave, CTO [email protected] Data Team Scott Brave Brandon Vargo Michael Rose Turn a partial contact into a full contact Bart Lorang CEO / Co-Founder FullContact, Inc. Gender: Male Age: 32 Location: Boulder, Colorado [email protected] WEB EMAIL work [email protected] TWITTER personal @lorangb PHONE work +1 303-736-9406 Single Version of Truth Complete, Accurate Contact Info. Synced Everywhere. for Professionals for Business as a Platform Sync Security Tagging Transcription Sharing Cleansing & Correction FullContact Cloud Address Book Enrichment Search Backups Storage Versioning De-duplication & Merge Validation & Verification Identibase identity resolution ABC 1 [email protected] 2 [email protected] 3 scott.brave DEF ABC 2 3 1 2 1 3 ~4B contact versions 4 [email protected] 5 @sbrave 3 2 1,4 DEF 4 5 4 5 5 Identibase Data Flow - Batch once-a-month Sherlock*Search* contacts* Hadoop& Iden=base* contacts*&* feedback* FullContact*Apps* iPhone*App* Webapp* Gmail*Ext.* Cardreader* Sig.*Extrac=on* Admin* API* HBase& enriched*contacts* &*deduplica=on* Person*API* Public*web* Social*Network*APIs* 3rdCParty*DBs/APIs* Challenges Sherlock*Search* Hadoop& Iden=base* contacts*&* feedback* FullContact*Apps* iPhone*App* Webapp* Gmail*Ext.* Cardreader* Sig.*Extrac=on* Admin* API* Coupled systems (w/production impact) 2. Slow to incorporate new data 3. Difficult to test/ experiment 4. Hard-to-follow code contacts* HBase& enriched*contacts* &*deduplica=on* Person*API* Public*web* Social*Network*APIs* 3rdCParty*DBs/APIs* 1. Goals 7 Key Data Initiatives Capability What types of questions can we answer? Quality How good are the answers? Performance How fast can we answer questions? Latency How long does it take for new data/feedback to be incorporated? Plugability How easy is it to add new data sources? Agility How quickly can we iterate on new algorithms/ functionality? Transparency How easily can we explore & debug the underlying data/algorithms? The Plan contacts' contacts'&' feedback' Iden+base' HBase& enriched'contacts' &'deduplica+on' Person'API' Hadoop& Data'Bus' FullContact'Apps' Sherlock'Search' Step #1: Data Bus Apache Kafka …is publish-subscribe messaging rethought as a distributed commit log Advantages source: LinkedIn • “Reliable, resilient, high-throughput” • Efficient - uses OS well • Trivial horizontal scalability • It’s a log, not a queue: Rewind, Replay, Multiple consumers. • Systems don’t have to know about each other • Backbone for real-time processing contacts' • All Contact versions go on the Data Bus • All Feedback data goes on the Data Bus • Frizzle for cleaner/domainspecific interfaces HBase& Person'API' Iden+base' S3' contacts'&' feedback' Se co r& Hadoop& Data'Bus' Ka+a& FullContact'Apps' Sherlock'Search' FullContact on Kafka • Frizzle Stream for message batching • Secor (from Pinterest) for archival to S3 • Decoupling continues to pay dividends! enriched'contacts' &'deduplica+on' Kafka Topics at FullContact Once available, data quickly gets used everywhere Step #2: Real-time source: MapR Apache Crunch • Abstraction layer on top of MapReduce • More developer friendly than Pig/Hive • Modeled after Google’s FlumeJava • Flexible Data Types: PCollection<T>, PTable<K,V> • Simple but Powerful Operations: parallelDo(), groupByKey(), combineValues(), flatten(), count(), join(), sort(), top() • Implementation agnostic, runs on: Hadoop, Spark, in-Memory BEFORE contacts' HBase& enriched'contacts' &'deduplica+on' Person'API' Iden+base' S3' contacts'&' feedback' Se co r& Hadoop& Data'Bus' Ka+a& FullContact'Apps' Sherlock'Search' Step #2: Real-time Step #2: Real-time Iden+base' Crunch' In+ Memory& enriched'contacts' &'deduplica+on' Person'API' Hadoop& Crunch' Query& contacts'&' feedback' Crunch' HBase& S3' contacts' Data'Bus' Ka(a& FullContact'Apps' Sherlock'Search' AFTER Crunch Snippet Why Crunch? Iden+base' Crunch' Person'API' Hadoop& Crunch' Query& contacts'&' feedback' Crunch' HBase& S3' contacts' Data'Bus' Ka(a& FullContact'Apps' Sherlock'Search' • Existing Code: Already had MapReduce jobs written in Java In+ Memory& enriched'contacts' &'deduplica+on' Batch: 4B contact versions Real-time: 5K (300K) edges/sec Query: 1000/sec, ~20ms@95th • Existing Experience: With Operating Hadoop clusters • Cleaner Code • Consolidated pipeline definition • Same code for batch, realtime, query and testing! • Interested in exploring Spark Crunch on Spark • For Batch • For Real-time • For Query • For Testing Lessons Leaned • “It just works” • Data Bus - The gift that keeps on giving • Batching • Sometimes coordination is needed • Crunch in memory mode not designed out of the box for production workloads • Recreating Hadoop Configuration on each pipeline invocation — expensive! • Excessive locking around counters • Serialization verification Thank You! Scott Brave [email protected]
© Copyright 2024