PDF - Big Data Everywhere

Building a Real-Time Data Pipeline from Batch
Scott Brave, CTO
[email protected]
Data Team
Scott Brave
Brandon Vargo
Michael Rose
Turn a partial contact
into a full contact
Bart Lorang
CEO / Co-Founder
FullContact, Inc.
Gender: Male
Age: 32
Location: Boulder, Colorado
[email protected]
WEB
EMAIL
work
[email protected]
TWITTER
personal
@lorangb
PHONE
work
+1 303-736-9406
Single Version of Truth
Complete, Accurate Contact Info. Synced Everywhere.
for Professionals
for Business
as a Platform
Sync
Security
Tagging
Transcription
Sharing
Cleansing &
Correction
FullContact
Cloud Address Book
Enrichment
Search
Backups
Storage
Versioning
De-duplication &
Merge
Validation &
Verification
Identibase
identity resolution
ABC
1
[email protected]
2
[email protected]
3
scott.brave
DEF
ABC
2
3
1
2
1
3
~4B contact versions
4
[email protected]
5
@sbrave
3
2
1,4
DEF
4
5
4
5
5
Identibase
Data Flow - Batch
once-a-month
Sherlock*Search*
contacts*
Hadoop&
Iden=base*
contacts*&*
feedback*
FullContact*Apps*
iPhone*App*
Webapp*
Gmail*Ext.*
Cardreader*
Sig.*Extrac=on*
Admin*
API*
HBase&
enriched*contacts*
&*deduplica=on*
Person*API*
Public*web*
Social*Network*APIs*
3rdCParty*DBs/APIs*
Challenges
Sherlock*Search*
Hadoop&
Iden=base*
contacts*&*
feedback*
FullContact*Apps*
iPhone*App*
Webapp*
Gmail*Ext.*
Cardreader*
Sig.*Extrac=on*
Admin*
API*
Coupled systems (w/production impact)
2.
Slow to incorporate new
data
3.
Difficult to test/
experiment
4.
Hard-to-follow code
contacts*
HBase&
enriched*contacts*
&*deduplica=on*
Person*API*
Public*web*
Social*Network*APIs*
3rdCParty*DBs/APIs*
1.
Goals
7 Key Data Initiatives
Capability
What types of questions can we answer?
Quality
How good are the answers?
Performance
How fast can we answer questions?
Latency
How long does it take for new data/feedback to be
incorporated?
Plugability
How easy is it to add new data sources?
Agility
How quickly can we iterate on new algorithms/
functionality?
Transparency
How easily can we explore & debug the underlying
data/algorithms?
The Plan
contacts'
contacts'&'
feedback'
Iden+base'
HBase&
enriched'contacts'
&'deduplica+on'
Person'API'
Hadoop&
Data'Bus'
FullContact'Apps'
Sherlock'Search'
Step #1: Data Bus
Apache Kafka
…is publish-subscribe messaging rethought as a distributed commit log
Advantages
source: LinkedIn
•
“Reliable, resilient, high-throughput”
•
Efficient - uses OS well
•
Trivial horizontal scalability
•
It’s a log, not a queue: Rewind,
Replay, Multiple consumers.
•
Systems don’t have to know
about each other
•
Backbone for real-time
processing
contacts'
•
All Contact versions go on the
Data Bus
•
All Feedback data goes on the
Data Bus
•
Frizzle for cleaner/domainspecific interfaces
HBase&
Person'API'
Iden+base'
S3'
contacts'&'
feedback'
Se
co
r&
Hadoop&
Data'Bus'
Ka+a&
FullContact'Apps'
Sherlock'Search'
FullContact on Kafka
•
Frizzle Stream for message
batching
•
Secor (from Pinterest) for archival
to S3
•
Decoupling continues to pay
dividends!
enriched'contacts'
&'deduplica+on'
Kafka Topics at FullContact
Once available, data quickly gets used everywhere
Step #2: Real-time
source: MapR
Apache Crunch
• Abstraction layer on top of MapReduce
• More developer friendly than Pig/Hive
• Modeled after Google’s FlumeJava
• Flexible Data Types: PCollection<T>, PTable<K,V>
• Simple but Powerful Operations:
parallelDo(), groupByKey(), combineValues(),
flatten(), count(), join(), sort(), top()
• Implementation agnostic, runs on: Hadoop, Spark, in-Memory
BEFORE
contacts'
HBase&
enriched'contacts'
&'deduplica+on'
Person'API'
Iden+base'
S3'
contacts'&'
feedback'
Se
co
r&
Hadoop&
Data'Bus'
Ka+a&
FullContact'Apps'
Sherlock'Search'
Step #2: Real-time
Step #2: Real-time
Iden+base'
Crunch'
In+
Memory&
enriched'contacts'
&'deduplica+on'
Person'API'
Hadoop&
Crunch'
Query&
contacts'&'
feedback'
Crunch'
HBase&
S3'
contacts'
Data'Bus'
Ka(a&
FullContact'Apps'
Sherlock'Search'
AFTER
Crunch Snippet
Why Crunch?
Iden+base'
Crunch'
Person'API'
Hadoop&
Crunch'
Query&
contacts'&'
feedback'
Crunch'
HBase&
S3'
contacts'
Data'Bus'
Ka(a&
FullContact'Apps'
Sherlock'Search'
• Existing Code: Already had MapReduce jobs written in Java
In+
Memory&
enriched'contacts'
&'deduplica+on'
Batch: 4B contact versions
Real-time: 5K (300K) edges/sec
Query: 1000/sec, ~20ms@95th
• Existing Experience: With
Operating Hadoop clusters
• Cleaner Code
• Consolidated pipeline
definition
• Same code for batch, realtime, query and testing!
• Interested in exploring Spark
Crunch on Spark
• For Batch
• For Real-time
• For Query
• For Testing
Lessons Leaned
• “It just works”
• Data Bus - The gift that keeps on giving
• Batching
• Sometimes coordination is needed
• Crunch in memory mode not designed out of the box for
production workloads
• Recreating Hadoop Configuration on each pipeline invocation —
expensive!
• Excessive locking around counters
• Serialization verification
Thank You!
Scott Brave
[email protected]