Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign Our Team Team member: Elisee Habimana Jicong Wang Sridevi Maharaj Ronald Doku Mingjia Zhang Tobias Kin Hou Lei Ravi Khadiwala Duber Gomez Rui Yang Project leader: Yizhou Sun Rui Li Motivation - why Twitter? Wide Coverage Real Time Motivation - An Example • An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010 • Traditional communication almost impossible for 2-3 hours, first video image available 6-7 hours after quake Source: <Information Credibility on Twitter>, by Carlos Castillo et al. Motivation - Another Example • Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later Motivation • Twitter reshape the way people spread and receive information • The real time feature makes twitter a good source of breaking news • The official and verified accounts on twitter provides reliable information • We propose to build up a web application that provide reliable real time crime related information Demo Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign Table of Contents • • • • • • • • Major Challenges Crime Focused Crawling Tweet Classification Event Extraction Tweet Ranking Clustering Tools Summary Major Challenges • Most tweet contents are useless for us o o o o o o o • • • • Pointless babble – 40% Conversational – 38% Pass-along value – 9% Self-promotion – 6% Spam – 4% News – 4% Crime related - 0.005% Roughly 10,000 crime related tweets each day Information like location and time not always explicit Display only the most important tweets Present results in an organized fashion Source: <Twitter Study – August 2009> Kelly, Ryan, ed (August 12, 2009) Project Flowchart Crime Focus Crawling Crawling crime related tweets from Twitter Presented by Jicong Wang A Snapshot of Twitter Data USERID 43893075 ID 68542312782905344 TEXT Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315 LOCATION GeoLocation latitude=-6.196612, longitude=106.829552 PLACE TIME Thu May 12 00:05:35 CDT 2011 URLS url=http://lockerz.com/s/100883315, MentionedEntities: 37623286 66072730 Hashtags: also number of Followers, number of Friends, name of User, etc NOT ALL TWEETS ARE CRIME RELATED! ONLY about 0.005%! Observation Iteratively Refining Rules • Repeat the above procedures until an ideal rule is obtained Problem However, there are STILL many "fake" crime tweets Refine the Rules  Single Keyword e.g. crime, kill, death, police, cop, shot  Combination of Keywords  Key Phrases o o o found shot OR died OR injured OR body armed OR unarmed robbery police on scene of Result • Improved crawling result: Keyword Proportion of crime related tweets Single < 5% Combination 50% among results from single keywords • Crawling result: About 25,000 crawled tweets per day. Over 13,000 users per day. Tweets Classification Determine whether a tweet is a related event Presented by Tobias Kin Hou Lei Are these tweets related to crime? A Classification approach Features Engineering - Basic features • Concept clusters o o o o o o o Natural disaster: {earthquake,tornado, ...} Weapon: {weapon,weapons,gun,guns,gunshot, ...} Injure: {...} Burglar: {...} ... Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician, pizza,cook,music,dance justin bieber} • Could predict unseen words. e.g. Train on tornado warning, could predict earthquake warning. Tradition Classification Features  Only Text Classification  But Tweets are short and noisy. – at most 140 words – contain noisy words, – contain urls, tags; Features Engineering - Social Features • Special tags: o o #hpd #breaking news Features Engineering - Social Features • User as a feature o List of verified police departments on Twitter • URL • Date • Number Features Engineering - Social Features Classification Model • Naive Bayes o Easy and good-performance model for online classification. o Many meaningful features and training data, different classification models will performance the similar result. Training Data • Crawled in from Twitter at different period of times • Manually labeled by our team • 2000 samples for training, among them: o 60% positive samples o 40% negative samples • 1000 samples for testing o 65% positive samples o 35% negative samples Summary • About 100 concept clusters covers in different areas of the feature space • Average accuracy on test set is 83.788% Event Extraction Extracting event information and grouping Presented by Ravi Khadiwala Event Extraction • Within the text of an individual tweet there may be information not previously found in through data crawling • This information is often useful to the user o o o Allows user to visualize where crime occurred Allows user to view filter by category Decreases the amount of raw tweets the user must read • This information is also useful to improve performance o o o Ranking Clustering Improves accuracy The Social Location Web Temporal/Spatial Information Five potential sources of locations, listed in descending order of perceived usefulness: • GPS tagged tweets latitude=57.8433342, longitude=12.6506338 • 'Place' tagged tweets (57.6190897,12.427637),(57.6190897,12.7635394) 7.8653997, 12.7635394),(57.8653997,12.427637) • User location • Textual Location Extraction o o Named Entity Recognition Regular Expressions (5 Temporal/Spatial Information • Location information hierarchically structured based on reliability • Use Named Entity Recognition o o Succeeds on: "I just witnessed a robbery in Champaign" Fails on: "Breaking and entering at 128 Maple St." • Use regular expressions to recognize common formating of addresses, highways, etc. • Time based on tweet time Regex Example "[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULE VARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CA MP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|C NTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CO RNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESEN T|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD| DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|E XTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOR EST|FORE STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|F RWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREE N|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHT S|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|IN LT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|K EY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT| LIGHTS| Location Disambiguation • Search extracted locations through a city to GPS lookup table • Many American city names are repeated (Atlanta,IL vs Atlanta,GA) o Check for well formated locations (city,state) o If not, resolve by selecting matched city with the largest population • Give preferences to other location sources (like user location and GPS) when there are multiple matches Categorization • Would like categories with finer granularity than crime or not crime • Based on keyword partitions corresponding to categories, ex: o Robbery/Theft: {robbed,robbery,burglar,theft...} o Natural Disaster: {tornado,typhoon,earthquake...} • Keyword based crawling guarantees presence of words that convey meaningful category information Ranking Scoring and Ordering Tweets based on Importance Presented by Ravi Khadiwala Ranking • We only want to display best "n" tweets o o o Nature of twitter may result in an extremely variable amount of data Serves as another way to filter non-crime tweets May be able to highlight important events • Summarize the most important data points o Avoid overwhelming the user with results Learning to Rank Goal: Learn a function f: X -> r where X is a vector of features and r is a importance score Strategy: Take pointwise approach and use a sample of manually scored data find the curve that fits our labeled data We use linear regression using the simple least squares method to find weights such that r = w1x1 + w2x2 + w3x3 + . . . wnxn Determine Ranking Features • Selected from a large pool of potential features • Social o Number of hashtags,urls,@ (indicates a reply), retweet count • Contextual o Tweet length, category, mentioned locations • User Credibility o Age of user account, friends, followers, status count, verification • Classifier Confidence Ranking Features and Weights • Labeled ~500 tweets with a ranking (integer from 1 to 5) • Linear regression on all features (normalized) o o o Examined correlation coefficients Examined weights Pruned features • Repeated until we had an adequate feature set with logical weights Ranking Features and Weights Weights -0.996904004778 2.87974471144 1.71671010105 1.17242993534 2.67005302808 -3.97882564778 Features category account age favorites status count followers confidence Clustering Geographical location: determinant for grouping tweets together Presented by Ronald Doku Clustering tweets • Clustering of tweets means to group overlapping tweets found in the same location into one category. Why is tweet clustering important? • Clustered tweets inform the user about where most events are happening at a particular time. • The sizes of the clustered tweets also convey how relevant or important the tweets are. • eg. A user may want to find out how far a wild fire outbreak is spreading or has spread to. Clustered tweets of the wildfire on the map shows the user where the fire is or has spread to. Clustered tweets: high level overview Clustered tweets: after click (California) How do we cluster tweets? Also by defining at which zoomlevels each tweet should appear, we cluster the tweets to reduce the number shown at a time. We call this hierarchical clustering. Miscellaneous/Tools Presented by Sridevi Maharaj Tools Summary Conceptual Level – Detects and monitors crime via a popular 21st century social media Technical Level – Developed crawler to obtain data – Identified and explored useful features from social network to rank and classify crime System Level – Built user-friendly system – Works in real time – Processes large collection of data – Iphone interface supported Questions?