slides

Data Science for Computational Journalism
Chengkai Li
Associate Professor, Department of Computer Science and Engineering
Director, Innovative Database and Information Systems Research (IDIR) Laboratory
University of Texas at Arlington
PyData Dallas, April 26, 2015
Research at the Innovative Database and Information
Systems Research (IDIR) Laboratory
Research areas
o Big Data and Data Science (Database, Data Mining, Wed Data Management,
Information Retrieval)
Theme of current research
o building large-scale human-assisting and human-assisted data and information systems
with high usability, high efficiency and applications for social good
Research directions
o computational journalism
o database testing
o crowdsourcing and human computation o entity search and entity query
o data exploration by
o graph database usability
ranking/skyline/preference queries
Our Computational Journalism Project
o Started in 2010. Collaborative project with Duke,
Google Research, HP Labs, Stanford
o Fact finding: finding and monitoring number-based
facts pertinent to real-world events. The facts are leads
to news stories.
o Fact checking: discovering and checking factual claims in
debates, speeches, interviews, news
FactWatcher
Tuple t for new real
world event appended
to database
Constraint
Measure
month=Feb
pts, ast, reb
opp_team=Nets
ast, reb
team=Celtics ∧
opp_team=Nets
ast, reb
…
…
http://en.wikipedia.org/wiki/Basketball
Find constraint-measure pair (C, M) such that
t is in the contextual skyline
Wesley had 12 points, 13 assists and
5 rebounds on February 25, 1996 to
Generate factual claim become the first player with a
12/13/5 (points/assists/rebounds)
in February.
Factual Claims
Prominent streaks
o “This month the Chinese capital has experienced 10 days with a maximum temperature
in around 35 degrees Celsius – the most for the month of July in a decade.”
o “The Nikkei 225 closed below 10000 for the 12th consecutive week, the longest such
streak since June 2009.”
Situational facts
o “Paul George had 21 points, 11 rebounds and 5 assists to become the first Pacers player
with a 20/10/5 (points/rebounds/assists) game against the Bulls since Detlef Schrempf
in December 1992.”
o “The social world’s most viral photo ever generated 3.5 million likes, 170,000 comments
and 460,000 shares by Wednesday afternoon.”
Domains: politics, sports, weather, crimes, transportation,
finance, social media analytics, publications
http://idir.uta.edu/factwatcher/
People Make Claims All The Time
“… our Navy is smaller than it's been since 1917", said Republican
candidate Mitt Romney in third presidential debate in 2012.
http://en.wikipedia.org/wiki/Mitt_Romney
http://www.thebrainchildgroup.com/
Fact Checking is not Easy
“… our Navy is smaller than it's been since 1917", said Republican
candidate Mitt Romney in third presidential debate in 2012.
http://en.wikipedia.org/wiki/Mitt_Romney
http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf
Fact Checking is not Easy
“… our Navy is smaller than it's been since 1917", said Republican
candidate Mitt Romney in third presidential debate in 2012.
vs
http://en.wikipedia.org/wiki/Mitt_Romney
http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf
http://en.wikipedia.org/wiki/United_States_Navy
Existing Fact Checking Projects
Journalists and reporters spend good amount of time on
fact checking
Politifact http://www.politifact.com/
FactCheckEU https://factcheckeu.org/
FullFact http://fullfact.org/
Snopes http://www.snopes.com/info/whatsnew.asp
Factcheck http://www.factcheck.org/
ClaimBusters
Long-term goal
o (Partly) automate fact checking process
speeches
debates
classification
interviews
& ranking
social media
news
o Plan for Election 2016
factual claims
ranked by
importance
checked by algorithms /
journalists/citizens /crowd
(e.g., Twitter users)
Current progress
o Classification models for finding check-worthy factual statements
o Preliminary exploration of crowdsourcing fact-checking
Factual Claim Classification
Dataset: presidential debates
o Source: http://www.debates.org/index.php?page=debate-transcripts
o All 30 debates (11 elections) in history: 1960, 1976—2012
o 20k sentences by presidential candidates: removed very short (< 5 words) sentences
Classify each sentence into 1 of 3 classes
Examples of Sentences
Important factual claims
“We spend less on the military today than at any time in our history.” “The President’s position
on gay marriage has changed.” “More people are unemployed today than four years ago.”
Unimportant factual claims
“I was in Iowa yesterday.” “My mother enjoys cooking.” “I ran for President once before.”
Sentences with no factual claims (just opinions, questions & declarations)
“Iran must not get nuclear weapons.” “7% unemployment is too high.” “My opponent is
wishy-washy.” “I will be tough on crime.” "Why should we do that?“ “Hello, New
Hampshire!” “Our plan is to reduce tax rate by 10%.”
Ground Truth Collection
Each sentence is labelled by two of many participants. The ground truth includes
the sentence only if the two participants agreed on its class label.
How We Use Python
Data wrangling
o Use NLTK (Natural Language Toolkit) to transform debate files into structured data format
o Use mysql-python-connector to store extracted features into an MySQL database
o Use matplotlib to plot classifiers’ performance.
Feature extraction
o Use AlchemyAPI (Python wrapper) to extract rich features of sentences: keywords, POS
(part-of-speech) tags, sentiments, entities, concepts, taxonomy
Classification
o Use scikit-learn to build classification models
Feature Extraction
Keywords, POS (part-of-speech) tags
import nltk
sentence = 'The tax policy for the middle class is bad.'
pos = nltk.pos_tag(nltk.word_tokenize(sentence))
print(pos)
[('The', 'DT'), ('tax', 'NN'), ('policy', 'NN'), ('for', 'IN'), ('the', 'DT'), ('middle', 'NN'),
('class', 'NN'), ('is', 'VBZ'), ('bad', 'JJ')]
Feature Extraction
Sentiments
from alchemyapi import AlchemyAPI
alchemyapi = AlchemyAPI()
sentence = ‘The tax policy for the middle class is bad.'
response = alchemyapi.sentiment('text', sentence)
sentiment = response['docSentiment']['score']
print(sentiment)
-0.6532
Feature Extraction
Entities
response = alchemyapi.combined('text', sentence, {'sentiment': 1})
print(response['entities'])
[{'sentiment': {'type': 'negative', 'score': '-0.653232'}, 'count': '1', 'type':
'FieldTerminology', 'relevance': '0.33', 'text': 'tax policy'}]
Feature Extraction
Concepts
print(response[‘concepts'])
[{'opencyc': 'http://sw.opencyc.org/concept/Mx4rvViw25wpEbGdrcN5Y29ycA',
'dbpedia': 'http://dbpedia.org/resource/Middle_class',
'freebase': 'http://rdf.freebase.com/ns/m.01lbc_',
'text': 'Middle class', 'relevance': '0.921176'},
{'dbpedia': 'http://dbpedia.org/resource/Social_class',
'freebase': 'http://rdf.freebase.com/ns/m.07714',
'text': 'Social class', 'relevance': '0.869326'}]
Feature Extraction
Taxonomy
print(response[‘taxonomy'])
/law, govt and politics / legal issues / legislation
Classification Models
Use scikit-learn to build classification models
o Naïve Bayes Classifier(NBC)
o Support Vector Machine (SVM)
LinearSVC (linear kernel, multi-class classification)
o Random Forest Classifier (RFC)
200 trees in the forest (n_estimators = 200)
Preliminary Experiments
3 classes
o NFS (non-factual-statement), NO (unimportant factual claim), YES (important
factual claim)
5 categories of features
o K: keyword; ET: entity type; P: POS tag; C: concept; T: taxonomy
5 combinations of features (+sentiment, +length)
o K; K+P; K+P+ET; K+P+ET+C; K+P+ET+C+T
Instances
o 1571 sentences in ground truth
o training data : test data = 3:1
o 4-fold cross validation
Classification Using scikit-learn
#last column is the class attribute
features = data.columns[0:-1]
#splitting train/test data (handout)
msk = np.random.rand(len(data)) <= 0.75
train = data[msk][features]
test = data[~msk][features]
train_verdict = data[msk].verdict
test_verdict = data[~msk].verdict
#building and applying the model
clf = RandomForestClassifier(n_estimators=200)#GaussianNB()#LinearSVC()
clf.fit(train, train_verdict)
prediction = clf.predict(test)
#cross validation
cv = np.sqrt(abs(cross_val_score(clf, data[features], data.verdict, cv=4,
scoring='accuracy').mean()))
Results: Precision
SVM
RFC
NBC
Results: Recall
SVM
RFC
NBC
Results: F-Measure
SVM
RFC
NBC
You are Invited
http://bit.ly/1FSj9pt
Acknowledgment
UTA Students
o Naeemul Hassan
o Joseph Minumol
o Afroza Sultana
o Jisa Sebastine
o Gensheng Zhang
Collaborators
o Bill Adair (Duke)
o Mark Tremayne (UTA)
o Pankaj Agarwal (Duke)
o Min Wang (Google Research)
o Sarah Cohen (Columbia)
o Jun Yang (Duke)
o James Hamilton (Stanford)
o Cong Yu (Google Research)
o Ping Luo (Chinese Academy of Sciences)
Acknowledgment
Funding sponsors
Disclaimer: This material is based upon work partially supported by the National
Science Foundation Grants 1018865, 1117369 and 1408928, 2011 and 2012 HP
Labs Innovation Research Awards, and the National Natural Science Foundation
of China Grant 61370019. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the funding agencies.
Thank You! Questions?
http://ranger.uta.edu/~cli
http://idir.uta.edu
[email protected]
Please help us to label the data
http://bit.ly/1FSj9pt