Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University Us C. Aaron Cois, Ph.D. Tim Palko Software Architect, Team Lead CMU Software Engineering Institute Digital Intelligence and Investigations Directorate Senior Software Engineer CMU Software Engineering Institute Digital Intelligence and Investigations Directorate @aaroncois © 2011 Carnegie Mellon University Overview • Problem Statement • Sensor Hardware & System Requirements • System Overview – Data Collection – Data Modeling – Data Access – Event Monitoring and Notification • Conclusions and Future Work The Goal Critical infrastructure/facility protection via Environmental Monitoring Why? Stuxnet • Two major components: 1) Send centrifuges spinning wildly out of control 2) Record ‘normal operations’ and play them back to operators during the attack 1 • Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound 1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2& The Broader Vision Quick, flexible out-of-band monitoring • Set up monitoring in minutes • Versatile sensors, easily repurposed • Data communication is secure (P2P VPN) and requires no existing systems other than outbound networking The Platform A CMU research project called Sensor Andrew • Features: – Open-source sensor platform – Scalable and generalist system supporting a wide variety of applications – Extensible architecture • Can integrate diverse sensor types Sensor Andrew Nodes End Users Gateway Server Gateway Sensor Andrew Overview What is a Node? A node collects data and sends it to a collector, or gateway Environment Node Sensors • Light • Audio • Humidity • Pressure • Motion • Temperature • Acceleration Power Node Sensors • Current • Voltage • True Power • Energy Radiation Node Sensors • Alpha particle count per minute Particulate Node Sensors • Small Part. Count • Large Part. Count What is a Gateway? • A gateway receives UDP data from all nodes registered to it • An internal service: – Receives data continuously – Opens a server on a specified port – Continually transmits UDP data over this port Gateway Requirements We need to.. 1. 2. 3. 4. 5. Collect data from nodes once per second Scale to 100 gateways each with 64 nodes Detect events in real-time Notify users about events in real-time Retain all data collected for years, at least What Is Big Data? What Is Big Data? “When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze and share it.” Problems Size Transmission Rate Storage Problems Size Transmission Rate Storage Problems Size Transmission Rate Storage Problems Size Transmission Rate Storage Problems Size Transmission Rate Storage Problems Size Transmission Rate Storage Retrieval Collecting Data Problem: Store and retrieve immense amounts of data at a high rate. Constraints: Data cannot remain on the nodes or gateways due to security concerns. Limited infrastructure. 8 GB / hour Gateway ? We Tried PostgreSQL… • Advantages: – Reliable, tested and scalable – Relational => complex queries => analytics • Problems: – Performance problems reading while writing at a high rate; real-time event detection suffers – ‘COPY FROM’ doesn’t permit horizontal scaling Q: How can we decrease I/O load? Q: How can we decrease I/O load? A: Read and write collected data directly from memory Enter Redis Redis is an in-memory NoSQL database Commonly used as a web application cache or pub/sub server Redis • Created in 2009 • Fully In-memory key-value store – Fast I/O: R/W operations are equally fast – Advanced data structures • Publish/Subscribe Functionality – In addition to data store functions – Separate from stored key-value data Persistence • Snapshotting – Data is asynchronously transferred from memory to disk • AOF (Append Only File) – Each modifying operation is written to a file – Can recreate data store by replaying operations – Without interrupting service, will rebuild AOF as the shortest sequence of commands needed to rebuild the current dataset in memory Replication • Redis supports master-slave replication • Master-slave replication can be chained • Be careful: – Slaves are writeable! – Potential for data inconsistency • Fully compatible with Pub/Sub features Redis Features Advanced Data Structures List “A” A:3 B C:1 B:4 D C “A” field2 “B” field3 “C” field4 “D” D:2 “D” [A, B, C, D] field1 A “B” “C” Hash Sorted Set Set {A, B, C, D} {value:score} {key:value} {C:1, D:2, A:3, D:4} {field1:“A”, field2:“B”…} Our Data Model Constraints Our data store must: – Hold time-series data – Be flexible in querying (by time, node, sensor) – Allow efficient querying of many records – Accept data out of order Tradeoffs: Efficiency vs. Flexibility One record per timestamp Motion Audio Light Pressure Humidity Acceleration Temperature VS One record per sensor data type Motion Light Temperature Audio Pressure Humidity Acceleration A Our Solution: Sorted Set Datapoint sensor:env:101 Score Value 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311} Our Solution: Sorted Set Datapoint sensor:env:101 Score Value 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311} Our Solution: Sorted Set Datapoint sensor:env:101 Score Value 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311} Our Solution: Sorted Set Datapoint sensor:env:101 Score Value 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311} Sorted Set 1357542004000: {“temp”:523,..} 1357542005000: {“temp”:523,..} 1357542007000: {“temp”:530,..} 1357542008000: {“temp”:531,..} 1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..} … Sorted Set 1357542004000: {“temp”:523,..} 1357542005000: {“temp”:523,..} 1357542006000: {“temp”:527,..} <- fits nicely 1357542007000: {“temp”:530,..} 1357542008000: {“temp”:531,..} 1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..} … Know your data structure! A set is still a set… Datapoint Score Value 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311} Requirement Satisfied Gateway Redis There is a disturbance in the Force.. Collecting Data Gateway Redis “In Memory” Means Many Things • The data store capacity is aggressively capped – Redis can only store as much data as the server has RAM Collecting Big Data Gateway Redis We could throw away data… • If we only cared about current values • However, our data – Must be stored for 1+ years for compliance – Must be able to be queried for historical/trend analysis We Still Need Long-term Data Storage Solution? Migrate data to an archive with expansive storage capacity Winning Redis Gateway Archiver Postgre SQL Winning? Redis Gateway Archiver ? Some Poor Client ? ? Postgre SQL Yes, Winning Redis A P I Gateway Some Happy Client Archiver Postgre SQL Gateway Redi s A P I Archiver Postg reSQL Best of both worlds Redis allows quick access to real-time data, for monitoring and event detection PostgreSQL allows complex queries and scalable storage for deep and historical analysis We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”? We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”? What about new data types? Gateway Redis A P I Archiver motion > x && pressure < y && audio > z Postgre SQL App DB Django App New guy: provide a way to read the data and create rules Gateway Redis A P I All true? motion > x pressure < y audio > z New guy: read the rules and data, trigger alarms Event Event Monitor Monitor Archiver Postgre SQL App DB Django App Gateway Redis A P I Archiver Postgre SQL Event monitor services can be scaled independently Event Event Monitor Monitor App DB Django App Getting The Message Out Getting The Message Out Considerations • Event monitor already has a job, avoid retasking as a notification engine Getting The Message Out Considerations • Event monitor already has a job, avoid retasking as a notification engine • Notifications most efficiently should be a “push” instead of needing to poll Getting The Message Out Considerations • Event monitor already has a job, avoid retasking as a notification engine • Notifications most efficiently should be a “push” instead of needing to poll • Notification system should be generalized, e.g. SMTP, SMS If only… Pub/Sub with synchronized workers is an optimal solution to real-time event notifications. Redis Data Gateway Redis Pub/Sub A P I Archiver Worker Worker Notificatio n Worker Postgre SQL Event Event Monitor Monitor No need to add another system, Redis offers pub/sub services as well! App DB Django App SMTP Conclusions • Redis is a powerful tool for collecting large amounts of data in real-time • In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system • Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub Things to watch • Data persistence – if Redis needs to restart, it takes 10-20 seconds per gigabyte to re-load all data into memory 1 – Redis is unresponsive during startup 1 http://oldblog.antirez.com/post/redis-persistence-demystified.html Future Work • Improve scalability through: – Data encoding – Data compression – Parallel batch inserts for all nodes on a gateway • Deep historical data analytics Acknowledgements • Project engineers Chris Taschner and Jeff Hamed @ CMU SEI • Prof. Anthony Rowe & CMU ECE WiSE Lab http://wise.ece.cmu.edu/ • Our organizations CMU CERT SEI Cylab https://www.cmu.edu http://www.cert.org http://www.sei.cmu.edu https://www.cylab.cmu.edu Thank You Thank You Questions? Slides of Live Redis Demo A Closer Look at Redis Data redis> keys * 1)"sensor:environment:f80” 2)"sensor:environment:f81” 3)"sensor:environment:f82" 4)"sensor:environment:f83" 5)"sensor:environment:f84" 6)"sensor:power:f85" 7)"sensor:power:f86" 8)"sensor:radiation:f87" 9)"sensor:particulate:f88" A Closer Look at Redis Data redis> keys sensor:power:* 1)"sensor:power:f85" 2)"sensor:power:f86” A Closer Look at Redis Data redis> zcount sensor:power:f85 –inf +inf (integer) 3565958 (45.38s) A Closer Look at Redis Data redis> zcount sensor:power:f85 1359728113000 +inf (integer) 47 A Closer Look at Redis Data redis> zrange sensor:power:f85 -1000 -1 1) "{\"long_energy1\": 73692453, \"total_secs\": 6784, \"energy\": [49, 175, 62, 0, 0, 0], \"c2_center\": 485, \"socket_state\": 1, \"node_type\": \"power\", \"c_p2p_low2\": 437, \"socket_state1\": 0, \"mac_address\": \"103\", \"c_p2p_low\": 494, \"rms_current\": 6, \"true_power\": 1158, \"timestamp\": 1359728143000, \"v_p2p_low\": 170, \"c_p2p_high\": 511, \"rms_current1\": 113, \"freq\": 60, \"long_energy\": 4108081, \"v_center\": 530, \"c_p2p_high2\": 719, \"energy1\": [37, 117, 100, 4, 0, 0], \"v_p2p_high\": 883, \"c_center\": 509, \"rms_voltage\": 255, \"true_power1\": 23235}” 2) … Redis Python API import redis pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0) r = redis.Redis(connection_pool=pool) byindex = r.zrange(“sensor:env:f85”, -50, -1) # ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:… byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000) # ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:… size = r.zcount(“sensor:env:f85”, "-inf", "+inf") # 237327L
© Copyright 2024