Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko

Large-Scale Data Collection Using
Redis
C. Aaron Cois, Ph.D. -- Tim Palko
CMU Software Engineering Institute
© 2011 Carnegie Mellon University
Us
C. Aaron Cois, Ph.D.
Tim Palko
Software Architect, Team Lead
CMU Software Engineering
Institute
Digital Intelligence and
Investigations Directorate
Senior Software Engineer
CMU Software Engineering
Institute
Digital Intelligence and
Investigations Directorate
@aaroncois
© 2011 Carnegie Mellon University
Overview
• Problem Statement
• Sensor Hardware & System Requirements
• System Overview
– Data Collection
– Data Modeling
– Data Access
– Event Monitoring and Notification
• Conclusions and Future Work
The Goal
Critical infrastructure/facility
protection
via
Environmental Monitoring
Why?
Stuxnet
• Two major components:
1) Send centrifuges spinning wildly out of control
2) Record ‘normal operations’ and play them back
to operators during the attack 1
• Environmental monitoring provides secondary
indicators, such as abnormal
heat/motion/sound
1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&
The Broader Vision
Quick, flexible out-of-band monitoring
• Set up monitoring in minutes
• Versatile sensors, easily repurposed
• Data communication is secure (P2P VPN) and
requires no existing systems other than
outbound networking
The Platform
A CMU research project called Sensor Andrew
• Features:
– Open-source sensor platform
– Scalable and generalist system supporting a
wide variety of applications
– Extensible architecture
•
Can integrate diverse sensor types
Sensor Andrew
Nodes
End
Users
Gateway
Server
Gateway
Sensor Andrew Overview
What is a Node?
A node collects data and sends it to a collector, or gateway
Environment Node
Sensors
• Light
• Audio
• Humidity
• Pressure
• Motion
• Temperature
• Acceleration
Power Node
Sensors
• Current
• Voltage
• True Power
• Energy
Radiation Node
Sensors
• Alpha particle
count per minute
Particulate
Node Sensors
• Small Part. Count
• Large Part. Count
What is a Gateway?
• A gateway receives UDP data
from all nodes registered to
it
• An internal service:
– Receives data continuously
– Opens a server on a specified
port
– Continually transmits UDP
data over this port
Gateway
Requirements
We need to..
1.
2.
3.
4.
5.
Collect data from nodes once per second
Scale to 100 gateways each with 64 nodes
Detect events in real-time
Notify users about events in real-time
Retain all data collected for years, at least
What Is Big Data?
What Is Big Data?
“When your data sets become so
large that you have to start innovating
around how to collect, store,
organize, analyze and share it.”
Problems
Size
Transmission
Rate
Storage
Problems
Size
Transmission
Rate
Storage
Problems
Size
Transmission
Rate
Storage
Problems
Size
Transmission
Rate
Storage
Problems
Size
Transmission
Rate
Storage
Problems
Size
Transmission
Rate
Storage
Retrieval
Collecting Data
Problem: Store and retrieve immense amounts of data at a high rate.
Constraints: Data cannot remain on the nodes or gateways due to
security concerns.
Limited infrastructure.
8 GB / hour
Gateway
?
We Tried PostgreSQL…
• Advantages:
– Reliable, tested and scalable
– Relational => complex queries => analytics
• Problems:
– Performance problems reading while writing at a
high rate; real-time event detection suffers
– ‘COPY FROM’ doesn’t permit horizontal scaling
Q: How can we decrease I/O load?
Q: How can we decrease I/O load?
A: Read and write collected data directly
from memory
Enter Redis
Redis is an in-memory
NoSQL database
Commonly used as a web application cache or
pub/sub server
Redis
• Created in 2009
• Fully In-memory key-value store
– Fast I/O: R/W operations are equally fast
– Advanced data structures
• Publish/Subscribe Functionality
– In addition to data store functions
– Separate from stored key-value data
Persistence
• Snapshotting
– Data is asynchronously transferred from memory
to disk
• AOF (Append Only File)
– Each modifying operation is written to a file
– Can recreate data store by replaying operations
– Without interrupting service, will rebuild AOF as
the shortest sequence of commands needed to
rebuild the current dataset in memory
Replication
• Redis supports master-slave replication
• Master-slave replication can be chained
• Be careful:
– Slaves are writeable!
– Potential for data inconsistency
• Fully compatible with Pub/Sub features
Redis Features Advanced Data
Structures
List
“A”
A:3
B
C:1
B:4
D
C
“A”
field2
“B”
field3
“C”
field4
“D”
D:2
“D”
[A, B, C, D]
field1
A
“B”
“C”
Hash
Sorted Set
Set
{A, B, C, D}
{value:score}
{key:value}
{C:1, D:2, A:3, D:4}
{field1:“A”, field2:“B”…}
Our Data Model
Constraints
Our data store must:
– Hold time-series data
– Be flexible in querying (by time, node, sensor)
– Allow efficient querying of many records
– Accept data out of order
Tradeoffs: Efficiency vs. Flexibility
One record per
timestamp
Motion
Audio
Light
Pressure
Humidity
Acceleration
Temperature
VS
One record per
sensor data type
Motion
Light
Temperature
Audio
Pressure
Humidity
Acceleration
A
Our Solution: Sorted Set
Datapoint sensor:env:101
Score
Value
1357542004000
{“bat”: 192, "temp": 523, "digital_temp": 216,
"mac_address": "20f", "humidity": 22, "motion":
203, "pressure": 99007, "node_type": "env",
"timestamp": 1357542004000, "audio_p2p":
460, "light": 820, "acc_z": 464, "acc_y": 351,
"acc_x": 311}
Our Solution: Sorted Set
Datapoint sensor:env:101
Score
Value
1357542004000
{“bat”: 192, "temp": 523, "digital_temp": 216,
"mac_address": "20f", "humidity": 22, "motion":
203, "pressure": 99007, "node_type": "env",
"timestamp": 1357542004000, "audio_p2p":
460, "light": 820, "acc_z": 464, "acc_y": 351,
"acc_x": 311}
Our Solution: Sorted Set
Datapoint sensor:env:101
Score
Value
1357542004000
{“bat”: 192, "temp": 523, "digital_temp": 216,
"mac_address": "20f", "humidity": 22, "motion":
203, "pressure": 99007, "node_type": "env",
"timestamp": 1357542004000, "audio_p2p":
460, "light": 820, "acc_z": 464, "acc_y": 351,
"acc_x": 311}
Our Solution: Sorted Set
Datapoint sensor:env:101
Score
Value
1357542004000
{“bat”: 192, "temp": 523, "digital_temp": 216,
"mac_address": "20f", "humidity": 22, "motion":
203, "pressure": 99007, "node_type": "env",
"timestamp": 1357542004000, "audio_p2p":
460, "light": 820, "acc_z": 464, "acc_y": 351,
"acc_x": 311}
Sorted Set
1357542004000: {“temp”:523,..}
1357542005000: {“temp”:523,..}
1357542007000: {“temp”:530,..}
1357542008000: {“temp”:531,..}
1357542009000: {“temp”:540,..}
1357542001000: {“temp”:545,..}
…
Sorted Set
1357542004000: {“temp”:523,..}
1357542005000: {“temp”:523,..}
1357542006000: {“temp”:527,..} <- fits nicely
1357542007000: {“temp”:530,..}
1357542008000: {“temp”:531,..}
1357542009000: {“temp”:540,..}
1357542001000: {“temp”:545,..}
…
Know your data structure!
A set is still a set…
Datapoint
Score
Value
1357542004000
{“bat”: 192, "temp": 523, "digital_temp": 216,
"mac_address": "20f", "humidity": 22, "motion":
203, "pressure": 99007, "node_type": "env",
"timestamp": 1357542004000, "audio_p2p":
460, "light": 820, "acc_z": 464, "acc_y": 351,
"acc_x": 311}
Requirement Satisfied
Gateway
Redis
There is a disturbance in the Force..
Collecting Data
Gateway
Redis
“In Memory” Means Many Things
• The data store capacity is aggressively
capped
– Redis can only store as much data as the server
has RAM
Collecting Big Data
Gateway
Redis
We could throw away data…
• If we only cared about current values
• However, our data
– Must be stored for 1+ years for compliance
– Must be able to be queried for historical/trend
analysis
We Still Need Long-term Data Storage
Solution? Migrate data to an archive with
expansive storage capacity
Winning
Redis
Gateway
Archiver
Postgre
SQL
Winning?
Redis
Gateway
Archiver
?
Some Poor Client
?
?
Postgre
SQL
Yes, Winning
Redis
A
P
I
Gateway
Some Happy Client
Archiver
Postgre
SQL
Gateway
Redi
s
A
P
I
Archiver
Postg
reSQL
Best of both worlds
Redis allows quick access to
real-time data, for
monitoring and event
detection
PostgreSQL allows complex
queries and scalable storage
for deep and historical
analysis
We Have the Data, Now What?
Incoming data must be monitored and
analyzed, to detect significant events
We Have the Data, Now What?
Incoming data must be monitored and
analyzed, to detect significant events
What is “significant”?
We Have the Data, Now What?
Incoming data must be monitored and
analyzed, to detect significant events
What is “significant”?
What about new data types?
Gateway
Redis
A
P
I
Archiver
motion > x
&& pressure < y
&& audio > z
Postgre
SQL
App
DB
Django
App
New guy: provide a way
to read the data and
create rules
Gateway
Redis
A
P
I
All true?
motion > x
pressure < y
audio > z
New guy:
read the rules and
data, trigger
alarms
Event
Event
Monitor
Monitor
Archiver
Postgre
SQL
App
DB
Django
App
Gateway
Redis
A
P
I
Archiver
Postgre
SQL
Event monitor
services can be
scaled
independently
Event
Event
Monitor
Monitor
App
DB
Django
App
Getting The Message Out
Getting The Message Out
Considerations
• Event monitor already has a job, avoid retasking as a notification engine
Getting The Message Out
Considerations
• Event monitor already has a job, avoid retasking as a notification engine
• Notifications most efficiently should be a
“push” instead of needing to poll
Getting The Message Out
Considerations
• Event monitor already has a job, avoid retasking as a notification engine
• Notifications most efficiently should be a
“push” instead of needing to poll
• Notification system should be generalized,
e.g. SMTP, SMS
If only…
Pub/Sub with synchronized
workers is an optimal solution to
real-time event notifications.
Redis Data
Gateway
Redis
Pub/Sub
A
P
I
Archiver
Worker
Worker
Notificatio
n Worker
Postgre
SQL
Event
Event
Monitor
Monitor
No need to add
another system,
Redis offers
pub/sub services
as well!
App
DB
Django
App
SMTP
Conclusions
• Redis is a powerful tool for collecting large
amounts of data in real-time
• In addition to maintaining a rapid pace of
data insertion, we were able to concurrently
query, monitor, and detect events on our
Redis data collection system
• Bonus: Redis also enabled a robust, scalable
real-time notification system using pub/sub
Things to watch
• Data persistence
– if Redis needs to restart, it takes 10-20 seconds
per gigabyte to re-load all data into memory 1
– Redis is unresponsive during startup
1 http://oldblog.antirez.com/post/redis-persistence-demystified.html
Future Work
• Improve scalability through:
– Data encoding
– Data compression
– Parallel batch inserts for all nodes on a gateway
• Deep historical data analytics
Acknowledgements
• Project engineers Chris Taschner and Jeff
Hamed @ CMU SEI
• Prof. Anthony Rowe & CMU ECE WiSE Lab
http://wise.ece.cmu.edu/
• Our organizations
CMU
CERT
SEI
Cylab
https://www.cmu.edu
http://www.cert.org
http://www.sei.cmu.edu
https://www.cylab.cmu.edu
Thank You
Thank You
Questions?
Slides of Live Redis Demo
A Closer Look at Redis Data
redis> keys *
1)"sensor:environment:f80”
2)"sensor:environment:f81”
3)"sensor:environment:f82"
4)"sensor:environment:f83"
5)"sensor:environment:f84"
6)"sensor:power:f85"
7)"sensor:power:f86"
8)"sensor:radiation:f87"
9)"sensor:particulate:f88"
A Closer Look at Redis Data
redis> keys sensor:power:*
1)"sensor:power:f85"
2)"sensor:power:f86”
A Closer Look at Redis Data
redis> zcount sensor:power:f85 –inf +inf
(integer) 3565958
(45.38s)
A Closer Look at Redis Data
redis> zcount sensor:power:f85 1359728113000 +inf
(integer) 47
A Closer Look at Redis Data
redis> zrange sensor:power:f85 -1000 -1
1) "{\"long_energy1\": 73692453, \"total_secs\":
6784, \"energy\": [49, 175, 62, 0, 0, 0],
\"c2_center\": 485, \"socket_state\": 1,
\"node_type\": \"power\", \"c_p2p_low2\": 437,
\"socket_state1\": 0, \"mac_address\": \"103\",
\"c_p2p_low\": 494, \"rms_current\": 6,
\"true_power\": 1158, \"timestamp\":
1359728143000, \"v_p2p_low\": 170, \"c_p2p_high\":
511, \"rms_current1\": 113, \"freq\": 60,
\"long_energy\": 4108081, \"v_center\": 530,
\"c_p2p_high2\": 719, \"energy1\": [37, 117, 100,
4, 0, 0], \"v_p2p_high\": 883, \"c_center\": 509,
\"rms_voltage\": 255, \"true_power1\": 23235}”
2) …
Redis Python API
import redis
pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)
r = redis.Redis(connection_pool=pool)
byindex = r.zrange(“sensor:env:f85”, -50, -1)
# ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…
byscore = r.zrangebyscore(“sensor:env:f85”,
1361423071000,
1361423072000)
# ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…
size = r.zcount(“sensor:env:f85”, "-inf", "+inf")
# 237327L