Delivering Intelligence to Publishers Through Big Data 2015-‐05-‐21 Jonathan Sharley Team Lead, Data Operations www.sovrn.com © 2015 Sovrn Holdings Inc. Who is Sovrn? Ø An advertising network with direct relationships to 20,000+ publisher websites Ø Advocate for medium to long tail publishers Ø A real-‐time bidding (RTB) exchange for programmatic ad buying © 2015 Sovrn Holdings Inc. Case Study: BI for Publishers “Sovrn works with online publishers of all shapes and sizes to help them better understand and engage their audiences and grow their businesses with effective site monetization.” © 2015 Sovrn Holdings Inc. Case Study: BI for Publishers 1. Basic quantitative metrics Requests, Impressions, Revenue, Unique counts sliced by Time, Zone, Site, Category, Geo 2. Ratios CPM, Fill 3. Comparative Metrics •Category averages •Viewer segmentation •Trending up or down? © 2015 Sovrn Holdings Inc. The Tech 1. Storm Near real-‐time aggregation 2. Hadoop Raw data, source of record 3. Cassandra Serving layer © 2015 Sovrn Holdings Inc. The Tech © 2015 Sovrn Holdings Inc. Storm © 2015 Sovrn Holdings Inc. Storm: Implementation • Data continuously processed in micro-‐ batches; updates submitted to Cassandra. • Measures are summed along various dimensions. • Column families organized as materialized views to minimize computation on retrieval. © 2015 Sovrn Holdings Inc. Storm: Implementation • Data ingested using Storm Trident abstraction, supporting exactly-‐once delivery. • Dynamic batch size depends on volume, backlog of streamed data. • We look at spout consumer offsets in Zookeeper to know how far we are behind. © 2015 Sovrn Holdings Inc. Storm: Implementation Throttling: Logs are collected for 10 seconds up to a maximum data size for each topic partition, to force a certain amount of reduction. Otherwise Storm may make many smaller updates and saturate Cassandra. A reduce on 100,000 vs 10,000 records produces only 20% more keys in the Cassandra update. © 2015 Sovrn Holdings Inc. Storm: Implementation Problem A Storm batch may necessitate more than one batch update to Cassandra. Trident will rerun failed batches, but how do we ensure exactly-‐once semantics across multiple writes to Cassandra? © 2015 Sovrn Holdings Inc. Storm: Implementation • Batches are sorted and split into 2000 key sub-‐batches for optimized writes to Cassandra. • Each sub-‐batch = 1 Cassandra update. • Batch processing logic is deterministic: we can reprocess a batch and produce the exact same sub-‐batches. © 2015 Sovrn Holdings Inc. Storm: Implementation • We use the transactional spout for Trident • In the case of a failure, the batch will have exactly the same messages. • Storm processing would have gaps if the exact same messages weren’t available. • This method relies on Kadka’s ability to reliably replay messages. © 2015 Sovrn Holdings Inc. Storm: Implementation • Trident transactions are aided by a key store > Before each sub-‐batch is written we check for a key of <Trident txid> + <sub-batch id>. If we dind the key, we skip our Cassandra sub-‐batch update. > After the Cassandra update, we mark the sub-‐batch done. • These steps are accomplished through the beginCommit() and commit() methods of the State interface. © 2015 Sovrn Holdings Inc. Storm: BI Outputs • Basic measures such as requests, impressions, revenue are summed along dimensions of time, site, zone, domestic/intl. • A publisher’s ad trafdic is counted and available in their dashboard within about 15 seconds. © 2015 Sovrn Holdings Inc. mydomain.com © 2015 Sovrn Holdings Inc. Hadoop © 2015 Sovrn Holdings Inc. Hadoop: Implementation • Raw log data from Kadka ingested through Camus into Avro schema-‐driven format • Runs as a map/reduce job where mappers provide ingestion concurrency -‐-‐ map per Kadka topic partition • We modidied Camus to create metadata for Hive, available via HCatalog for Pig, etc. © 2015 Sovrn Holdings Inc. Hadoop: Implementation Ka-a topic par44ons P1 P2 P3 P4 Camus Hadoop mappers M1 M2 M3 M4 Avro HDFS files Create partitions Hive Metastore Accessible in Hive & HCatalog dt=2015052110 dt=2015052111 © 2015 Sovrn Holdings Inc. Hadoop: Implementation • Avro formatted data is normalized and written into ORC format. • ORC benedits: Columnar with compression => Less I/O pressure, more retention Projection => Fetch just the column(s) you need ( We have about 100 columns of data on our ad activity) Predicate pushdown => Indexed row groups allow skipping of row sets not meeting select criteria. © 2015 Sovrn Holdings Inc. Hadoop: Implementation • Ad activity combined with user-‐supplied and 3rd party data points to segment publisher trafdic sources • Metric data pushed to Cassandra via Pig jobs, leveraging the Datastax client driver. Parallelism controlled as a crude method of rate limiting. © 2015 Sovrn Holdings Inc. Hadoop: BI Outputs • Segmented site audience info > Who visits and how often? > What other sites do they browse? > How much is each audience worth to advertisers? • Produce counts and averages by site category • Identify unique site viewers and assess their activity amongst sites in our network © 2015 Sovrn Holdings Inc. © 2015 Sovrn Holdings Inc. Hadoop: BI Outputs • Which brands are buying site inventory? • Which ad tags are serving? • Has any ad trafdic been denied? © 2015 Sovrn Holdings Inc. Hadoop: Other Benefits • Source of record: raw data supports analysts and drives algorithms that help better monetize unique viewers. • Validate counts done through Storm. • Backdill new metrics for publishers or repair BI metrics computed through Storm. © 2015 Sovrn Holdings Inc. Cassandra © 2015 Sovrn Holdings Inc. Cassandra: Implementation • Measures are Cassandra counters: requests, impressions, revenue, etc. • Counters are integer-‐only, so we shift the decimal point to handle decimal values such as revenue. • Each metric written by Storm is stored at the minute, hour, day, and month levels. © 2015 Sovrn Holdings Inc. Cassandra: Implementation • As counters have no TTL, we must explicitly delete counters for retention purposes. • We remove some minute-‐level metrics in detailed views after a week. © 2015 Sovrn Holdings Inc. Cassandra: Implementation On Query: • Time ranges are summed across multiple keys via the Sovrn API. We limit the number of key fetches using the largest possible time blocks within the date range . • Fill rate and CPM are calculated on the dly. • Revenue counter decimal point is shifted back to produce decimal value. © 2015 Sovrn Holdings Inc. mydomain.com © 2015 Sovrn Holdings Inc. Case Study: Lessons • Validate metrics between batch & streaming paths. • Monitoring is critical to catch discrepancies quickly. • Build in a catch-‐up mechanism to recover from outages or processing incidents. • Limit the number of Cassandra fetches when charting metrics to keep latency reasonable. © 2015 Sovrn Holdings Inc. Thank You I’m happy to take questions. email me at: [email protected] © 2015 Sovrn Holdings Inc.
© Copyright 2024