Crime/Event Detection on Twitter
Data Sciences Summer Institute 2011
Multimodal Information Access & Synthesis,
University of Illinois at Urbana-Champaign
Our Team
Team member:
Elisee Habimana
Jicong Wang
Sridevi Maharaj
Ronald Doku
Mingjia Zhang
Tobias Kin Hou Lei
Ravi Khadiwala
Duber Gomez
Rui Yang
Project leader:
Yizhou Sun
Rui Li
Motivation - why Twitter?
Wide Coverage
Real Time
Motivation - An Example
• An earthquake happened in Chile at 03:34 local time, Sat
Feb 27, 2010
• Traditional communication almost impossible for 2-3 hours,
first video image available 6-7 hours after quake
Source: <Information Credibility on Twitter>, by Carlos Castillo et al.
Motivation - Another Example
•
Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report
appears almost 3 hours later
Motivation
• Twitter reshape the way people spread and receive
information
• The real time feature makes twitter a good source
of breaking news
• The official and verified accounts on twitter provides reliable
information
• We propose to build up a web application that provide
reliable real time crime related information
Demo
Crime/Event Detection on Twitter
Data Sciences Summer Institute 2011
Multimodal Information Access & Synthesis,
University of Illinois at Urbana-Champaign
Table of Contents
•
•
•
•
•
•
•
•
Major Challenges
Crime Focused Crawling
Tweet Classification
Event Extraction
Tweet Ranking
Clustering
Tools
Summary
Major Challenges
• Most tweet contents are useless for us
o
o
o
o
o
o
o
•
•
•
•
Pointless babble – 40%
Conversational – 38%
Pass-along value – 9%
Self-promotion – 6%
Spam – 4%
News – 4%
Crime related - 0.005%
Roughly 10,000 crime related tweets each day
Information like location and time not always explicit
Display only the most important tweets
Present results in an organized fashion
Source: <Twitter Study – August 2009> Kelly, Ryan, ed (August 12, 2009)
Project Flowchart
Crime Focus Crawling
Crawling crime related tweets from Twitter
Presented by Jicong Wang
A Snapshot of Twitter Data
USERID 43893075
ID 68542312782905344
TEXT Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315
LOCATION GeoLocation latitude=-6.196612, longitude=106.829552
PLACE
TIME Thu May 12 00:05:35 CDT 2011
URLS
url=http://lockerz.com/s/100883315,
MentionedEntities: 37623286 66072730
Hashtags:
also number of Followers, number of Friends, name of User, etc
NOT ALL TWEETS ARE CRIME RELATED!
ONLY about 0.005%!
Observation
Iteratively Refining Rules
• Repeat the above procedures until an ideal rule is obtained
Problem
However, there are STILL many "fake" crime tweets
Refine the Rules
Single Keyword
e.g. crime, kill, death,
police, cop, shot
Combination of Keywords
Key Phrases
o
o
o
found shot OR died OR
injured OR body
armed OR unarmed robbery
police on scene of
Result
• Improved crawling result:
Keyword
Proportion of crime
related tweets
Single
< 5%
Combination
50% among results
from single keywords
• Crawling result: About 25,000 crawled tweets per day.
Over 13,000 users per day.
Tweets Classification
Determine whether a tweet is a related event
Presented by Tobias Kin Hou Lei
Are these tweets related to crime?
A Classification approach
Features Engineering - Basic features
• Concept clusters
o
o
o
o
o
o
o
Natural disaster: {earthquake,tornado, ...}
Weapon: {weapon,weapons,gun,guns,gunshot, ...}
Injure: {...}
Burglar: {...}
...
Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician,
pizza,cook,music,dance justin bieber}
• Could predict unseen words. e.g. Train on tornado warning,
could predict earthquake warning.
Tradition Classification Features
Only Text Classification
But Tweets are short and noisy.
– at most 140 words
– contain noisy words,
– contain urls, tags;
Features Engineering - Social Features
• Special tags:
o
o
#hpd
#breaking news
Features Engineering - Social Features
• User as a feature
o
List of verified police departments on Twitter
• URL
• Date
• Number
Features Engineering - Social Features
Classification Model
• Naive Bayes
o Easy and good-performance model for online
classification.
o Many meaningful features and training data, different
classification models will performance the similar result.
Training Data
• Crawled in from Twitter at different period of times
• Manually labeled by our team
• 2000 samples for training, among them:
o 60% positive samples
o 40% negative samples
• 1000 samples for testing
o 65% positive samples
o 35% negative samples
Summary
• About 100 concept clusters covers in different areas of the
feature space
• Average accuracy on test set is 83.788%
Event Extraction
Extracting event information and grouping
Presented by Ravi Khadiwala
Event Extraction
• Within the text of an individual tweet there may be
information not previously found in through data crawling
• This information is often useful to the user
o
o
o
Allows user to visualize where crime occurred
Allows user to view filter by category
Decreases the amount of raw tweets the user must read
• This information is also useful to improve performance
o
o
o
Ranking
Clustering
Improves accuracy
The Social Location Web
Temporal/Spatial Information
Five potential sources of locations, listed in descending
order of perceived usefulness:
• GPS tagged tweets
latitude=57.8433342, longitude=12.6506338
• 'Place' tagged tweets
(57.6190897,12.427637),(57.6190897,12.7635394)
7.8653997, 12.7635394),(57.8653997,12.427637)
• User location
• Textual Location Extraction
o
o
Named Entity Recognition
Regular Expressions
(5
Temporal/Spatial Information
• Location information hierarchically structured based on
reliability
• Use Named Entity Recognition
o
o
Succeeds on: "I just witnessed a robbery in Champaign"
Fails on: "Breaking and entering at 128 Maple St."
• Use regular expressions to recognize common formating
of addresses, highways, etc.
• Time based on tweet time
Regex Example
"[0-9]+ ([A-Z][A-Za-z]* )+
(ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN|
AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULE
VARD|BO
ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CA
MP|CMP|
CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|C
NTR|CTR|C
ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CO
RNERS|CORS|
COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESEN
T|CRSCNT|C
RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD|
DR|DRIV|DRI
VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|E
XTNSN|EXTE
NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOR
EST|FORE STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|F
RWY|FWY|GARDEN|GA
RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREE
N|GRN|G
REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHT
S|HGTS|HT
|HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|IN
LT|IS|ISL
AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|K
EY|KY|KEYS
|KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT|
LIGHTS|
Location Disambiguation
• Search extracted locations through a city to GPS lookup
table
• Many American city names are repeated (Atlanta,IL vs
Atlanta,GA)
o Check for well formated locations (city,state)
o If not, resolve by selecting matched city with the largest
population
• Give preferences to other location sources (like user
location and GPS) when there are multiple matches
Categorization
• Would like categories with finer granularity than crime or
not crime
• Based on keyword partitions corresponding to categories,
ex:
o Robbery/Theft: {robbed,robbery,burglar,theft...}
o Natural Disaster: {tornado,typhoon,earthquake...}
• Keyword based crawling guarantees presence of words
that convey meaningful category information
Ranking
Scoring and Ordering Tweets based on
Importance
Presented by Ravi Khadiwala
Ranking
• We only want to display best "n" tweets
o
o
o
Nature of twitter may result in an extremely variable amount of data
Serves as another way to filter non-crime tweets
May be able to highlight important events
• Summarize the most important data points
o
Avoid overwhelming the user with results
Learning to Rank
Goal: Learn a function f: X -> r
where X is a vector of features
and r is a importance score
Strategy:
Take pointwise approach and use a sample of manually
scored data find the curve that fits our labeled data
We use linear regression using the simple least squares
method to find weights such that
r = w1x1 + w2x2 + w3x3 + . . . wnxn
Determine Ranking Features
• Selected from a large pool of potential features
• Social
o
Number of hashtags,urls,@ (indicates a reply), retweet count
• Contextual
o
Tweet length, category, mentioned locations
• User Credibility
o
Age of user account, friends, followers, status count, verification
• Classifier Confidence
Ranking Features and Weights
• Labeled ~500 tweets with a
ranking (integer from 1 to 5)
• Linear regression on all
features (normalized)
o
o
o
Examined correlation coefficients
Examined weights
Pruned features
• Repeated until we had an
adequate feature set with
logical weights
Ranking Features and Weights
Weights
-0.996904004778
2.87974471144
1.71671010105
1.17242993534
2.67005302808
-3.97882564778
Features
category
account age
favorites
status count
followers
confidence
Clustering
Geographical location: determinant for
grouping tweets together
Presented by Ronald Doku
Clustering tweets
• Clustering of tweets means to group overlapping tweets
found in the same location into one category.
Why is tweet clustering important?
• Clustered tweets inform the user about where most events
are happening at a particular time.
• The sizes of the clustered tweets also convey how relevant
or important the tweets are.
• eg. A user may want to find out how far a wild fire outbreak
is spreading or has spread to. Clustered tweets of the
wildfire on the map shows the user where the fire is or has
spread to.
Clustered tweets: high level overview
Clustered tweets: after click (California)
How do we cluster tweets?
Also by defining at which zoomlevels each tweet should appear,
we cluster the tweets to reduce
the number shown at a time. We
call this hierarchical clustering.
Miscellaneous/Tools
Presented by Sridevi Maharaj
Tools
Summary
Conceptual Level
– Detects and monitors crime via a popular 21st century social
media
Technical Level
– Developed crawler to obtain data
– Identified and explored useful features from social network to rank
and classify crime
System Level
– Built user-friendly system
– Works in real time
– Processes large collection of data
– Iphone interface supported
Questions?
© Copyright 2025