Slides - French-Brazilian Spring School on Big Data and Smart Cities

1
Aggregating and Managing Big rEaltime Data in the Cloud: application to intelligent transport for Smart Cities
Parisa Ghodous
U. Claude Bernard, LIRIS
Saint Martin d’Hères, 10th April, 2015
Urbanization’s rapid progress has modernized many people’s
lives and engendered big issues
Traffic Congestion, Energy Consumption and Pollution!
+ Urban computing!
n 
Use data generated in cities, e.g., traffic flow, human mobility
and geographical data for a continuous improvement of people’s
lives, city operation systems, and the environment
n 
Connects urban sensing, data management, data analytics, and
service providing into a recurrent process
3
4
Key challenges
+ Urban data!
5
Environmental monitoring data
Meteorological data (humidity, temperature, barometer pressure, wind speed, and weather conditions crawled from websites
Mobile phone signals
Identifying behaviours, citywide human mobility for detecting urban anomalies, city’s functional regions & urban planning
Geographical data
Commuting data
Traffic monitoring and prediction, Urban
planning, routing, and energy consumption
analysis, POI, land use
Traffic data
Loop sensors, surveillance cameras, and floating cars, floating car data
Social Networks data
Social structure: a graph denoting relationship, interdependency, or
interaction between users. User-generated social media, texts,
photos, and videos, which contain user’s behaviour/interests
Economy
City’s economic dynamics: transaction records of credit cards, stock prices, housing prices, and people’s incomes
Energy
City’s energy consumption: obtained directly from sensors or inferred
from data sources implicitly, e.g. from the GPS trajectory of a vehicle
+ Applications in Urban computing!
Urban planning •  Gleaning Underlying Problems in Transportation Networks •  Discover Functional Regions •  Detecting a City’s Boundary Transportation •  Improving Driving Experiences •  Improving Taxi Services: dispatching, recommendation, ridesharing •  Improving Public Transportation Systems: bus, subway, bike Environment •  Air quality •  Noise pollution Social & Entertainment • 
• 
• 
• 
• 
Energy •  Gas consumption •  Electricity consumption Economy •  Finding trends of city economy •  Business placement Safety & Security •  Detecting traffic anomalies: distance based, statistics based •  Disaster detection and evacuation Estimate user similarity Finding local experts in a region Location recommendation Itinerary planning Life patterns and styles understanding 6
Urban computing framework!
Service providing Urban planning, ease traffic, save energy, reduce air pollution n 
Urban data analytics Data mining, machine learning, visualization Urban data management Spatio-­‐temporal index, stream, trajectory graph data management n 
Urban sensing & data acquisition
n 
Energy consumption & privacy
n 
Loose-controlled and non-uniform
distributed sensors
n 
Unstructured, implicit & noise data Computing with heterogeneous data
n 
Learn mutually reinforced knowledge from
heterogeneous data
Social media n 
Both effective and efficient learning ability
Meteorology Human Traffic Air mobility Energy quality n 
Visualization
POI Road networks Urban sensing & data acquisition Participatory sensing, crowd sensing, mobile sensing n 
Hybrid systems blending the physical and
virtual worlds
7
8
Overview of existing approaches
+ Current transport projects and apps!
http://www.itsoverview.its.dot.gov
avril 11, 2015
+ Current transport projects and apps!
TRAFFIC MANAGEMENT à Plan urban mobility
Sensors: C-S (e.g.,google)
Notification: push/pull
Monitoring Collaborative (e.g. Wayze, Copenhagen Wheel)
- Traffic thermometer (e.g., incident detection, Insight,Dublin)
information exchange using crowdsourcing
- Public transport monitor (e.g., Industrial transport, Urban insight, Cubic;
smart ticketing,AllbikesNow,Buzzcar)
Recommendation -  Public transport monitor, (Optimod, Lyon, LUTB)
(urban logistics)
-  Parking places
-  Infrastructure (ETINA intelligent traffic lights, LAPI lectures de
plaques, Télépéage)
-  Guidance: Waze, google maps, Walkscore
STANDARDIZATION OF ITS Urban ITSArchitectures
avril 11, 2015
+ Big data and intelligent transport!
n 
Transdec: big data for transportation
http://imsc.usc.edu/intelligent-transportation.html
n 
How big data drives intelligent transportation, Rocky Moutain Institute
http://www.greenbiz.com/blog/2012/08/15/how-big-data-drives-intelligent-transportation
n 
Real-Time Data Capture and Management
http://www.its.dot.gov/data_capture/data_capture.htm
n 
Traffic analytics
avril 11, 2015
12
Tailoring urban big data storage services
Vehicles Position & Energy levels!
Unexpected events communication!
avril 11, 2015
Queue length at the recharging stations location!
avril 11, 2015
Decision making for the autonomous vehicles to help
piloting the vehicles to their destination !
avril 11, 2015
Ensuring vehicles availability, service continuity!
avoiding accidents!
avril 11, 2015
Ensuring optimal recharging, through mobile
recharging units!
avril 11, 2015
Real time problems with greedy tasks requiring !
heavy treatment!
§  Lots of data (volume) §  Continuous (velocity)
§  Image, sound, compass, energy level, localisation… (variety)
avril 11, 2015
+ Problem statement!
n 
Data collection (what sources ?: compass, video stream, LADAR…)
n 
Data storage (keep or not and how long : missed parked car or someone
crossing the road)
n 
Data communication strategy optimise network (rate of communication, who’s
initiative)
n 
Scalability ( if we need extra vehicles: make it work with a 100 and with a 1000)
n 
polyglot programming (different programming for different needs)
n 
Data à information (image video à information de localisation) avril 11, 2015
+ Objectives!
n 
Develop service using big data for decision making
n 
Using Cloud and Streaming as tools
n 
Insuring that big data, cloud and streaming work well together
22
Managing transport big data in Smart Cities
+ Our vision: everything as a service!
Decision making
support services
Data analytics
Services
Integration & aggregation
Extended UnQL platform
Data storage Services
Neo4J
CouchDB
Clean data
collections
MongoDB
Data cleaning & processing Services
PIG
HADOOP
Data harvesting Services
REST
FLUME
23
24
Making global transport decisions
avril 11, 2015
Application server
(receiving thousands of requests)
Storage as a Service
Fragmented and
duplicated data
Data streams
Data storage
demands
Data streams
On demand data
Recurrent data
27
Looking for a taxi
avril 11, 2015
29
Disseminating events
avril 11, 2015
31
Making global transport decisions
avril 11, 2015
Data streams
(clients position/
requests)
Predict request
crowd at in a city region
@ a given hour
Compute a recommendation
Battery charge place,
Target region according
to traffic
On demand data
Prediction
Recommendation
requests
State of the traffic organized by region
Traffic situation
Recommendation
34
Research milestones
+ Ongoing work!
n 
QDB benchmark extends YCSB: FaultTolerance, Recoverability and TimeBehaviour
n 
Pivot data model for representing NoSQL stores data models
n 
Sample application: Shopping system1 (ProductInfo)
n 
Document data stores: MongoDB, Couchbase, VoltDB, Redis, Neo4J
n 
n 
35
Cluster of four Ubuntu 12.04 servers deployed with extra large VM instances (8
virtual cores and 14 GB of RAM) in Windows Azure2
Distributed polyglot (big) database engineering
n 
Model2Roo: engineering data storage solutions for given data collections
n 
ExSchema for supporting the maintenance of a polyglot storage solution
1 McMurtry,
D., Oakley, A., Sharp, J., Subramanian, M., Zhang, H.: Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence
Microsoft patterns & practices, Microsoft (2013)
2 http://www.windowsazure.com/
3 http://forge.puppetlabs.com/puppetlabs/
4Yahoo Cloud Serving Benchmark, https://github.com/brianfrankcooper/YCSB/wiki
+ Future directions!
n 
Balanced crowd-sensing
n 
Data is non-uniformly distributed in geographical and temporal spaces.
In some locations, we may have the data much more than what we really need. n  In the places where we may not have enough data or even do not have data at all, some incentives that can
motivate users to contribute data should be considered
How to configure the incentive for different locations and time periods so as to maximize the quality of the
received data (e.g., the coverage or accuracy) for a specific application is yet to explore.
A down-sampling method, e.g., compressive sensing, could be useful to reduce a system’s communication loads.
n 
n 
n 
n 
n 
n 
n 
n 
n 
Skewed data distribution
Managing and indexing multimode data sources
Knowledge fusion
Exploratory and interactive visualization for multiple data sources Algorithm integration
Intervention-based analysis and prediction 36
+ Future directions!
ü 
n 
Balanced crowdsensing
Skewed data distribution
n 
n 
n 
n 
n 
n 
n 
n 
Having the entire dataset may be always infeasible in an urban computing system Some information is transferrable from the partial data to the entire dataset: the
travel speed of taxis on roads can be transferred to other vehicles that are also
traveling on the same road segment
Some information do not: the traffic volume of taxis on a road may be different from
private vehicles
Managing and indexing multimode data sources
Knowledge fusion
Exploratory and interactive visualization for multiple data sources Algorithm integration
Intervention-based analysis and prediction 37
+ Future directions!
Balanced crowdsensing
ü  Skewed data distribution
n  Managing and indexing multimode data sources
ü 
n 
Hybrid index that can simultaneously manage multiple types of data (e.g.,
spatial, temporal and social media) Knowledge fusion
n  Exploratory and interactive visualization for multiple data
sources n  Algorithm integration
n  Intervention-based analysis and prediction n 
38
+ Future directions!
ü 
ü 
ü 
n 
Balanced crowdsensing
Skewed data distribution
Managing and indexing multimode data sources
Knowledge fusion
n 
n 
n 
n 
n 
Learn mutually reinforced knowledge from multiple data sources
Deep understanding of each data source and an effective usage of different data
sources in different parts of a computing framework
Exploratory and interactive visualization for multiple data sources Algorithm integration
Intervention-based analysis and prediction 39
+ Future directions!
ü 
ü 
ü 
ü 
n 
Balanced crowdsensing
Skewed data distribution
Managing and indexing multimode data sources
Knowledge fusion
Exploratory and interactive visualization for multiple data sources n 
n 
n 
n 
n 
Investigate the implicit relationship among multiple data sources through an exploratory
visualization in spatial and spatio-temporal spaces
Which factor is more prominent in impacting the air quality of a given location or in a given
time period? What is the major root cause of PM2.5 in the winter of Sao Paolo?
Algorithm integration
Intervention-based analysis and prediction 40
+ Future directions!
ü 
ü 
ü 
ü 
ü 
n 
Balanced crowdsensing
Skewed data distribution
Managing and indexing multimode data sources
Knowledge fusion
Exploratory and interactive visualization for multiple data sources
Algorithm integration: to provide an end-to-end urban computing scenario n 
n 
n 
n 
Combine data management techniques with machine learning algorithms to provide a both
efficient and effective knowledge discovery ability. Integrating spatio-temporal data management algorithms with optimization methods, to solve the
large-scale dynamic ridesharing problem. Visualization techniques should be involved in a knowledge discovery process, working with
machine learning and data mining algorithms. Intervention-based analysis and prediction 41
+ Future directions!
ü 
ü 
ü 
ü 
ü 
ü 
n 
Balanced crowdsensing
Skewed data distribution
Managing and indexing multimode data sources
Knowledge fusion
Exploratory and interactive visualization for multiple data sources
Algorithm integration
Intervention-based analysis and prediction: predict the impact of a change in a
city’s setting
n 
How a region’s traffic will change if a new road is built in the region? n 
To what extent the air pollution will be reduced if we remove a factory from a city? How people’s travel patterns will be affected if a new subway line is launched?
n 
42
+
Parisa Ghodous
Genoveva Vargas-Solar
Catarina Ferreira
Christine Collet
Gavin R. Kemp
+ Urban sensing & data acquisition!
n 
Traditional sensing and measurement: installing sensors
dedicated to some applications
n 
Passive crowd sensing
n 
Participatory sensing
45
+ Urban sensing & data acquisition!
n 
Traditional sensing and measurement
n 
Passive crowd sensing: wireless cellular networks are built for mobile communication
between individuals to sense city dynamics (e.g., predict traffic conditions and improve urban
planning)
n 
n 
n 
n 
n 
Sensing City Dynamics with GPS-Equipped Vehicles: mobile sensors continually probing the
traffic flow on road surfaces processed by infrastructures that produce data representing citywide human mobility patterns
Ticketing Systems of Public Transportation (e.g., model the city-wide human mobility using
transaction records of RFID-based cards swiping)
Wireless Communication Systems (e.g., call detailed records CDR)
Social Networking Services (e.g., geo-tagged posts/photos, posts on natural disasters analysed
for detecting anomalous events and mobility patterns in the city)
Participatory sensing
46
+ Urban sensing & data acquisition!
n 
Traditional sensing and measurement n 
Passive crowd sensing
n 
Participatory sensing: people obtain information around them and
contribute to formulate collective knowledge to solve a problem (i.e., human
as a sensor)
n 
n 
Human crowd-sensing: users willingly sense information gathered from sensors
embedded in their own devices (e.g., GPS data from a user’s mobile phone used to
estimate real- time bus arrivals)
Human crowd-sourcing: users are proactively engaged in the act of generating data:
reports on accidents, police traps, or any other road hazard (e.g. Waze), citizens
turning into cartographers, to create open maps of their cities
47
+ Urban data management!
Harness a variety of heterogeneous data to quickly answer users’ instant queries, e.g.
predicting traffic conditions and forecasting air pollution
n 
Stream and Trajectory Data Management
n 
n 
n 
n 
n 
Data reduction techniques for trajectories
Noise filtering techniques for trajectories
Techniques for indexing and query trajectories
Techniques dealing with uncertainty of a trajectory
Trajectory pattern mining
n 
Graph Data Management
n 
Hybrid Indexing Structures
48
+ Urban data management!
Harness a variety of heterogeneous data to quickly answer users’ instant queries, e.g.
predicting traffic conditions and forecasting air pollution
n 
Stream and Trajectory Data Management
n 
Graph Data Management: represent urban data, such as road networks, subway systems, social
networks, and sensor networks
Find the top-k tourist attractions around a user that are most popular in the past three months
n 
Graphs usually associated with a spatial property, resulting in many spatial graphs [Angles and Gutierrez 2008]
n 
For example, the node of a road network has a spatial coordinate and each edge denoting a road segment has a
spatial length
Graphs also contain temporal information. n 
n 
n 
For instance, the traffic volume traversing a road segment changes over time, and the travel time between two
landmarks is time dependent: st-graphs [Hong and Zheng et al. 2014]
Hybrid Indexing Structures
49
+ Urban data management!
Harness a variety of heterogeneous data to quickly answer users’ instant queries, e.g.
predicting traffic conditions and forecasting air pollution
n 
Stream and Trajectory Data Management
n 
Graph Data Management
n 
Hybrid Indexing Structures: harness a variety of data and integrate them into a
data mining model using hybrid indexing structures that can well organize
different data sources
n 
n 
n 
Combining POIs, road networks, traffic, and human mobility data simultaneously
A city partitioned into grids by using a quad-tree- based spatial index where each leaf
node (grid) of the spatial index maintains two lists storing the POIs and road segments
Each road segment ID points to two sorted list: a list of taxi IDs sorted by their arrival time
𝑡𝑎 at the road segment; a list of drop-off and pick-up points of passengers sorted by the
pick-up time (𝑡𝑝) and drop-off time (𝑡𝑑).
50
+ Urban data management!
Harness a variety of heterogeneous data to quickly answer users’ instant queries, e.g.
predicting traffic conditions and forecasting air pollution
n 
Stream and Trajectory Data Management
n 
Graph Data Management
n 
Hybrid Indexing Structures: harness a variety of data and integrate them into a
data mining model using hybrid indexing structures that can well organize
different data sources
n 
n 
n 
Combining POIs, road networks, traffic, and human mobility data simultaneously
A city partitioned into grids by using a quad-tree- based spatial index where each leaf
node (grid) of the spatial index maintains two lists storing the POIs and road segments
Each road segment ID points to two sorted list: a list of taxi IDs sorted by their arrival time
𝑡𝑎 at the road segment; a list of drop-off and pick-up points of passengers sorted by the
pick-up time (𝑡𝑝) and drop-off time (𝑡𝑑).
51
+ Knowledge fusion across
heterogeneous data!
Harness a variety of heterogeneous data sources to effectively fusion the knowledge
n 
Fusion different data sources at a feature level: put together the features
extracted from different data sources into one feature vector before feeding it into
a data analytics model
n 
Use different data at different stages (e.g., first partition a city into disjoint
regions by major roads and then use human mobility data to glean the
problematic configuration of a city’s transportation network)
n 
Feed different data sets into different parts of a model simultaneously
n 
Infer the functional regions in a city using road network data, points of interests, and
human mobility learned from a large number of taxi trips.
52
+ Knowledge fusion across
heterogeneous data!
Harness a variety of heterogeneous data sources to effectively fusion the knowledge
n 
Fusion different data sources at a feature level: put together the features
extracted from different data sources into one feature vector before feeding it into
a data analytics model
n 
Use different data at different stages (e.g., first partition a city into disjoint
regions by major roads and then use human mobility data to glean the
problematic configuration of a city’s transportation network)
n 
Feed different data sets into different parts of a model simultaneously
n 
Infer the functional regions in a city using road network data, points of interests, and
human mobility learned from a large number of taxi trips.
53
+ Urban data visualization!
Not solely about displaying raw data and presenting results, about detecting
and describing patterns, trends, and relations in data, motivated by certain
purposes of investigation
n  Spatial distributions changing over time (i.e., spaces in time) n  Profiles of local temporal variation distributed over space (i.e., time in
spaces) [Andrienko 2010]
54