Topical search in Twitter Complex Network Research Group

Topical search in Twitter
Complex Network Research Group
Department of CSE, IIT Kharagpur
Topical search on Twitter

Twitter has emerged as an important source of
information & real-time news


Most common search in Twitter: search for trending topics
and breaking news
Topical search



Identifying topical attributes / expertise of users
Searching for topical experts
Searching for information on specific topics
Prior approaches to find topic experts

Research studies



Pal et. al. (WSDM 2011) uses 15 features from tweets,
network, to identify topical experts
Weng et. al. (WSDM 2010) uses ML approach
Application systems


Twitter Who To Follow (WTF), Wefollow, …
Methodology not fully public, but reported to utilize
several features
Prior approaches use features
extracted from

User profiles


Tweets posted by a user


Screen-name, bio, …
Hashtags, others retweeting a given user, …
Social graph of a user

#followers, PageRank, …
Problems with prior approaches

User profiles – screen-name, bio, …



Tweets posted by a user


Bio often does not give meaningful information
Information in users profiles mostly unvetted
Tweets mostly contain day-to-day conversation
Social graph of a user – #followers, PageRank

Does not provide topical information
We propose …


Use a different way to infer topics of expertise for
an individual Twitter user
Utilize social annotations



How does the Twitter crowd describe a user?
Social annotations obtained through Twitter Lists
Approach essentially relies on crowdsourcing
Twitter Lists

A feature used to organize the people one is
following on Twitter



Create a named list, add an optional List description
Add related users to the List
Tweets posted by these users will be grouped together as
a separate stream
How Lists work ?
Using Lists to infer topics for users

If U is an expert / authority in a certain topic


U likely to be included in several Lists
List names / descriptions provide valuable semantic cues
to the topics of expertise of U
Dataset

Collected Lists of 55 million Twitter users who
joined before or in 2009



88 million Lists collected in total
All studies consider 1.3 million users who are
included in 10 or more Lists
Most List names / descriptions in English, but
significant fraction also in French, Portuguese, …
Inferring topical attributes of users
Mining Lists to infer expertise

Collect Lists containing a given user U
List names / descriptions collected into
a ‘document’ for the given user

Identify U’s topics from the document





Handle CamelCase words, case-folding
Ignore domain-specific stopwords
Identify nouns and adjective
Unify similar words based on edit-distance,
e.g., journalists and jornalistas, politicians
and politicos (not unified by stemming)
Mining Lists to infer expertise


Unigrams and bigrams considered as
topics
Result: Topics for U along with their
frequencies in the document
Topics inferred from Lists
politics, senator, congress, government,
republicans, Iowa, gop, conservative
politics, senate, government, congress,
democrats, Missouri, progressive, women
celebs, actors, famous, movies, comedy,
funny, music, hollywood, pop culture
linux, tech, open, software, libre, gnu,
computer, developer, ubuntu, unix
Lists vs. other features
Profile bio
love, daily, people, time, GUI, movie,
video, life, happy, game, cool
Most common
words from tweets
celeb, actor, famous, movie, stars,
comedy, music, Hollywood, pop culture
Most common
words from Lists
Lists vs. other features
Profile bio
Fallon, happy, love, fun, video, song,
game, hope, #fjoln, #fallonmono
Most common
words from tweets
celeb, funny, humor, music, movies,
laugh, comics, television, entertainers
Most common
words from Lists
Who-is-who service



Developed a Who-is-Who
service for Twitter
Shows word-cloud for
major topics for a user
http://twitter-app.mpisws.org/who-is-who/
Inferring Who-is-who in the Twitter
Social Network, WOSN 2012
(Highest rated paper in workshop)
Identifying topical experts
Topical experts in Twitter


400 million tweets posted daily
Quality of tweets posted by different users vary
widely


News, pointless babble, conversational tweets, spam, …
Challenge: to find topical experts

Sources of authoritative information on specific topics
Basic methodology

Given a query (topic)

Identify experts on the topic using Lists


Rank identified experts w.r.t. given topic


Discussed earlier
Need ranking algorithm
Additional challenge: keeping the system up-to-date
in face of thousands of users joining Twitter daily
Ranking experts

Used a ranking scheme solely based on Lists

Two components of ranking user U w.r.t. query Q



Relevance of user to query – cover density ranking
between topic document TU of user and Q
Popularity of user – number of Lists including the user
Cover Density ranking preferred for short queries
Topic relevance( TU, Q ) × log( #Lists including U )
Cognos

Search system for topical experts in Twitter
Publicly deployed at
http://twitter-app.mpi-sws.org/whom-to-follow/

Cognos: Crowdsourcing Search for Topic Experts in Microblogs,
ACM SIGIR 2012
Cognos
results for
“politics”
Cognos
results for
“stem cell”
Evaluation of Cognos - 1

Competes favorably with prior research attempts to
identify topical experts (Pal et al. [WSDM 2011])
Evaluation of Cognos – 2


Cognos compared with Twitter WTF
Evaluator shown top 10 results by both systems




27 distinct queries were asked at least twice


Result-sets anonymized
Evaluator judges which is better / both good / both bad
Queries chosen by evaluators themselves
In total, asked 93 times
Judgment by majority voting
Cognos vs Twitter WTF

Cognos judged better on 12 queries


Twitter WTF judged better on 11 queries



Computer science, Linux, mac, Apple, ipad, India,
internet, windows phone, photography, political journalist
Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter,
metallica, cloud computing, IIT Kharagpur
Mostly names of individuals or organizations
Tie on 4 queries

Microsoft, Dell, Kolkata, Sanskrit as an official language
Cognos vs Twitter WTF

Low overlap between top 10 results


… In spite of same topic being inferred for 83% experts
Major differences are due to List-based ranking


Top Twitter WTF results – mostly business accounts
Top Cognos results – mostly personal accounts
Keeping system up-to-date

Any search / recommendation system on OSN
platform needs to be kept up-to-date



Thousands of new users join every day
Need efficient way of discovering topical experts
Can brute force approach be used?

Periodically crawl data (profile, Lists) of all users
Scalability problem



200 million new users joined Twitter during 9
months in 2011  740K new users join daily
Lower-bound estimate: 1480K API calls per day
required to crawl their profiles and Lists
Twitter allows only 3.6K API calls per day per IP


480K API calls per day from whitelisted IP
Plus, 465 million users already
How many experts in Twitter?

Only 1% listed 10 or more times

Only 0.12% listed 100 or more times

If experts can be identified efficiently, possible to
crawl their Lists
Identifying experts efficiently

Hubs – users who follow many experts and add
them to Lists



Identified top hubs in social network using HITS
Crawled Lists created by top 1 million hubs
Top 1M hubs listed 4.1M users


2.06M users included in 10 or more Lists (50%)
Discovered 65% of the estimated number of experts listed
100 or more times
Identifying experts efficiently




More than 42% of the users listed by top hubs have
joined Twitter after 2009
Discovered several popular experts who joined
within the duration of the crawl
All experts reported by Pal et. al. discovered
Discovered all Twitter WTF top 20 results for 50%
of the queries, 15 or more for 80% of the queries
Topical search in Twitter
Looking for Tweets by Topic

Services today are limited to keyword search


Knowing which keywords to search for, is itself an issue
Keyword search is not context aware

Tweets are too small to deduce topics

Topic analysis of 400M tweets/day is a challenge
Challenges

Some tweets are more important than others



Millions of tweets are posted on popular topics
Only some are relevant to the context intended
Tweets may contain wrong or misleading info



Twitter has a large population of spammers
Twitter is also a potent source of rumors
Some tweets are outright malicious
Our Approach to the Issues

Scalability


Topic deduction


We only look at tweets from as small subset of users who
are experts on different topics
We map user expertise topics, to tweets/hashtags, instead
of the other way round
Trustworthiness


Our source of tweets is a small subset of users
It is practical to vet their expertise and reputation
Advantages of list-based methodology
600K experts on 36K distinct topics
Topical
Diversity
of
Expert
Sample
CSCW’14
Popular
Topics
Niche
Topics
Challenges in Used Approach

We assign topics to tweets/hashtags

Inferring tweet topics from tweeter expertise



Experts can have multiple topics of expertise
Experts do tweet about topics beyond their expertise
Solution: If multiple experts on a subject tweet
about something, it is most likely related to the
topic.
Sampling Tweets from Experts

We capture all tweets from 585K topical experts



The experts generate 1.46 million tweets/per day


This is a set we obtained from our previous study
This about 0.1% of the whole Twitter population
This is 0.268% of all tweets on twitter
Expertise in diverse topics (36K)


Our topics of expertise is crowd sourced
We will have more topics as more users show interests
Methodology at a Glance

Given a topic, we gather tweets from experts
We use hashtags to represent subjects

Clustering Tweets by similar hashtags



Ranking clusters by popularity



A cluster represents information on related subjects
Number of unique experts tweeting on the subject
Number of unique tweets on the subject
Ranking tweets by authority

Tweets from highest ranked user is shown first
What-is-happening on Twitter

twitter-app.mpi-sws.org/what-is-happening/
Topical search in Microblogs with Cognoscenti, Or: The Wisdom of Crowdsourced
Experts,
Results for the
last week on
Politics
(a popular topic)
Related tweets are
grouped together by
common hashtags.
Number of experts
tweeting on the subject
and the number of tweets
on the subject decides
ranking.
The most popular tweet
from the most
authoritative user
represents the group.
Our system specially
excels for niche
topics.
Evaluation – Relevance

We used Amazon Mechanical Turk for user
evaluation




Users have to judge if the tweet/hashtag was
relevant to the given topic



We chose to evaluate 20 topics
We picked top 10 tweets and hashtags
We picked results for all 3 time groups
Options are Relevant/Not Relevant/Can’t Say
We chose master workers only
Every tweet/hashtag was evaluated by at least 4
users
Evaluating Tweet Relevance

We obtained 3150 judgments

76% of which were Relevant

22% Not Relevant, 2% Can’t Say

80% of the Tweets were marked relevant by
majority judgment
Dissecting Negative Judgments



Iphone was the topic which received most negative
results
Experts on Iphone were generally tweeting on the
overall topic (such as androids, tablets, …)
Last week time group had most positive results

Scarcity of information led to bad ranking
Evaluating Hashtag Relevance

Total 3200 judgments

62.3% were Relevant


Much less than tweets (76% were marked relevant)
Relevance of hashtags is very context sensitive
Perspectival relevance


The generic hashtag #sandy is very relevant to the
topics in context of the tweet.
These got negative judgments when shown without
the tweets.
Generic Hashtags


Some hashtags are generic, but our service brings
our their specificity with respect to the topic.
These hashtags received negative judgments when
shown without the context of the tweet.
Summary

Simple Core Observation
Users curate experts

Services
who-is who (WOSN’12, CCR’12)
whom-to-follow (SIGIR’12)
what-is-happening (in-submission)
Sample-stream (CIKM’13, CSCW’14)

Complex Network Research Group
Thank You
Contact: [email protected]
Complex Network Research Group (CNeRG)
CSE, IIT Kharagpur, India
http://cse.iitkgp.ac.in/resgrp/cnerg/