C /E D on T

Crime/Event Detection on Twitter
Data Sciences Summer Institute 2011
Multimodal Information Access & Synthesis,
University of Illinois at Urbana-Champaign
Our Team
Team member:
Elisee Habimana
Jicong Wang
Sridevi Maharaj
Ronald Doku
Mingjia Zhang
Tobias Kin Hou Lei
Ravi Khadiwala
Duber Gomez
Rui Yang
Project leader:
Yizhou Sun
Rui Li
Motivation - why Twitter?
Wide Coverage
Real Time
Motivation - An Example
• An earthquake happened in Chile at 03:34 local time, Sat
Feb 27, 2010
• Traditional communication almost impossible for 2-3 hours,
first video image available 6-7 hours after quake
Source: <Information Credibility on Twitter>, by Carlos Castillo et al.
Motivation - Another Example
•
Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report
appears almost 3 hours later
Motivation
• Twitter reshape the way people spread and receive
information
• The real time feature makes twitter a good source
of breaking news
• The official and verified accounts on twitter provides reliable
information
• We propose to build up a web application that provide
reliable real time crime related information
Demo
Crime/Event Detection on Twitter
Data Sciences Summer Institute 2011
Multimodal Information Access & Synthesis,
University of Illinois at Urbana-Champaign
Table of Contents
•
•
•
•
•
•
•
•
Major Challenges
Crime Focused Crawling
Tweet Classification
Event Extraction
Tweet Ranking
Clustering
Tools
Summary
Major Challenges
• Most tweet contents are useless for us
o
o
o
o
o
o
o
•
•
•
•
Pointless babble – 40%
Conversational – 38%
Pass-along value – 9%
Self-promotion – 6%
Spam – 4%
News – 4%
Crime related - 0.005%
Roughly 10,000 crime related tweets each day
Information like location and time not always explicit
Display only the most important tweets
Present results in an organized fashion
Source: <Twitter Study – August 2009> Kelly, Ryan, ed (August 12, 2009)
Project Flowchart
Crime Focus Crawling
Crawling crime related tweets from Twitter
Presented by Jicong Wang
A Snapshot of Twitter Data
USERID 43893075
ID 68542312782905344
TEXT Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315
LOCATION GeoLocation latitude=-6.196612, longitude=106.829552
PLACE
TIME Thu May 12 00:05:35 CDT 2011
URLS
url=http://lockerz.com/s/100883315,
MentionedEntities: 37623286 66072730
Hashtags:
also number of Followers, number of Friends, name of User, etc
NOT ALL TWEETS ARE CRIME RELATED!
ONLY about 0.005%!
Observation
Iteratively Refining Rules
• Repeat the above procedures until an ideal rule is obtained
Problem
However, there are STILL many "fake" crime tweets
Refine the Rules
 Single Keyword
e.g. crime, kill, death,
police, cop, shot
 Combination of Keywords
 Key Phrases
o
o
o
found shot OR died OR
injured OR body
armed OR unarmed robbery
police on scene of
Result
• Improved crawling result:
Keyword
Proportion of crime
related tweets
Single
< 5%
Combination
50% among results
from single keywords
• Crawling result: About 25,000 crawled tweets per day.
Over 13,000 users per day.
Tweets Classification
Determine whether a tweet is a related event
Presented by Tobias Kin Hou Lei
Are these tweets related to crime?
A Classification approach
Features Engineering - Basic features
• Concept clusters
o
o
o
o
o
o
o
Natural disaster: {earthquake,tornado, ...}
Weapon: {weapon,weapons,gun,guns,gunshot, ...}
Injure: {...}
Burglar: {...}
...
Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician,
pizza,cook,music,dance justin bieber}
• Could predict unseen words. e.g. Train on tornado warning,
could predict earthquake warning.
Tradition Classification Features
 Only Text Classification
 But Tweets are short and noisy.
– at most 140 words
– contain noisy words,
– contain urls, tags;
Features Engineering - Social Features
• Special tags:
o
o
#hpd
#breaking news
Features Engineering - Social Features
• User as a feature
o
List of verified police departments on Twitter
• URL
• Date
• Number
Features Engineering - Social Features
Classification Model
• Naive Bayes
o Easy and good-performance model for online
classification.
o Many meaningful features and training data, different
classification models will performance the similar result.
Training Data
• Crawled in from Twitter at different period of times
• Manually labeled by our team
• 2000 samples for training, among them:
o 60% positive samples
o 40% negative samples
• 1000 samples for testing
o 65% positive samples
o 35% negative samples
Summary
• About 100 concept clusters covers in different areas of the
feature space
• Average accuracy on test set is 83.788%
Event Extraction
Extracting event information and grouping
Presented by Ravi Khadiwala
Event Extraction
• Within the text of an individual tweet there may be
information not previously found in through data crawling
• This information is often useful to the user
o
o
o
Allows user to visualize where crime occurred
Allows user to view filter by category
Decreases the amount of raw tweets the user must read
• This information is also useful to improve performance
o
o
o
Ranking
Clustering
Improves accuracy
The Social Location Web
Temporal/Spatial Information
Five potential sources of locations, listed in descending
order of perceived usefulness:
• GPS tagged tweets
latitude=57.8433342, longitude=12.6506338
• 'Place' tagged tweets
(57.6190897,12.427637),(57.6190897,12.7635394)
7.8653997, 12.7635394),(57.8653997,12.427637)
• User location
• Textual Location Extraction
o
o
Named Entity Recognition
Regular Expressions
(5
Temporal/Spatial Information
• Location information hierarchically structured based on
reliability
• Use Named Entity Recognition
o
o
Succeeds on: "I just witnessed a robbery in Champaign"
Fails on: "Breaking and entering at 128 Maple St."
• Use regular expressions to recognize common formating
of addresses, highways, etc.
• Time based on tweet time
Regex Example
"[0-9]+ ([A-Z][A-Za-z]* )+
(ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN|
AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULE
VARD|BO
ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CA
MP|CMP|
CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|C
NTR|CTR|C
ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CO
RNERS|CORS|
COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESEN
T|CRSCNT|C
RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|
DR|DRIV|DRI
VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|E
XTNSN|EXTE
NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOR
EST|FORE STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|F
RWY|FWY|GARDEN|GA
RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREE
N|GRN|G
REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHT
S|HGTS|HT
|HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|IN
LT|IS|ISL
AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|K
EY|KY|KEYS
|KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|
LIGHTS|
Location Disambiguation
• Search extracted locations through a city to GPS lookup
table
• Many American city names are repeated (Atlanta,IL vs
Atlanta,GA)
o Check for well formated locations (city,state)
o If not, resolve by selecting matched city with the largest
population
• Give preferences to other location sources (like user
location and GPS) when there are multiple matches
Categorization
• Would like categories with finer granularity than crime or
not crime
• Based on keyword partitions corresponding to categories,
ex:
o Robbery/Theft: {robbed,robbery,burglar,theft...}
o Natural Disaster: {tornado,typhoon,earthquake...}
• Keyword based crawling guarantees presence of words
that convey meaningful category information
Ranking
Scoring and Ordering Tweets based on
Importance
Presented by Ravi Khadiwala
Ranking
• We only want to display best "n" tweets
o
o
o
Nature of twitter may result in an extremely variable amount of data
Serves as another way to filter non-crime tweets
May be able to highlight important events
• Summarize the most important data points
o
Avoid overwhelming the user with results
Learning to Rank
Goal: Learn a function f: X -> r
where X is a vector of features
and r is a importance score
Strategy:
Take pointwise approach and use a sample of manually
scored data find the curve that fits our labeled data
We use linear regression using the simple least squares
method to find weights such that
r = w1x1 + w2x2 + w3x3 + . . . wnxn
Determine Ranking Features
• Selected from a large pool of potential features
• Social
o
Number of hashtags,urls,@ (indicates a reply), retweet count
• Contextual
o
Tweet length, category, mentioned locations
• User Credibility
o
Age of user account, friends, followers, status count, verification
• Classifier Confidence
Ranking Features and Weights
• Labeled ~500 tweets with a
ranking (integer from 1 to 5)
• Linear regression on all
features (normalized)
o
o
o
Examined correlation coefficients
Examined weights
Pruned features
• Repeated until we had an
adequate feature set with
logical weights
Ranking Features and Weights
Weights
-0.996904004778
2.87974471144
1.71671010105
1.17242993534
2.67005302808
-3.97882564778
Features
category
account age
favorites
status count
followers
confidence
Clustering
Geographical location: determinant for
grouping tweets together
Presented by Ronald Doku
Clustering tweets
• Clustering of tweets means to group overlapping tweets
found in the same location into one category.
Why is tweet clustering important?
• Clustered tweets inform the user about where most events
are happening at a particular time.
• The sizes of the clustered tweets also convey how relevant
or important the tweets are.
• eg. A user may want to find out how far a wild fire outbreak
is spreading or has spread to. Clustered tweets of the
wildfire on the map shows the user where the fire is or has
spread to.
Clustered tweets: high level overview
Clustered tweets: after click (California)
How do we cluster tweets?
Also by defining at which zoomlevels each tweet should appear,
we cluster the tweets to reduce
the number shown at a time. We
call this hierarchical clustering.
Miscellaneous/Tools
Presented by Sridevi Maharaj
Tools
Summary
Conceptual Level
– Detects and monitors crime via a popular 21st century social
media
Technical Level
– Developed crawler to obtain data
– Identified and explored useful features from social network to rank
and classify crime
System Level
– Built user-friendly system
– Works in real time
– Processes large collection of data
– Iphone interface supported
Questions?