Data Science for Computational Journalism Chengkai Li Associate Professor, Department of Computer Science and Engineering Director, Innovative Database and Information Systems Research (IDIR) Laboratory University of Texas at Arlington PyData Dallas, April 26, 2015 Research at the Innovative Database and Information Systems Research (IDIR) Laboratory Research areas o Big Data and Data Science (Database, Data Mining, Wed Data Management, Information Retrieval) Theme of current research o building large-scale human-assisting and human-assisted data and information systems with high usability, high efficiency and applications for social good Research directions o computational journalism o database testing o crowdsourcing and human computation o entity search and entity query o data exploration by o graph database usability ranking/skyline/preference queries Our Computational Journalism Project o Started in 2010. Collaborative project with Duke, Google Research, HP Labs, Stanford o Fact finding: finding and monitoring number-based facts pertinent to real-world events. The facts are leads to news stories. o Fact checking: discovering and checking factual claims in debates, speeches, interviews, news FactWatcher Tuple t for new real world event appended to database Constraint Measure month=Feb pts, ast, reb opp_team=Nets ast, reb team=Celtics ∧ opp_team=Nets ast, reb … … http://en.wikipedia.org/wiki/Basketball Find constraint-measure pair (C, M) such that t is in the contextual skyline Wesley had 12 points, 13 assists and 5 rebounds on February 25, 1996 to Generate factual claim become the first player with a 12/13/5 (points/assists/rebounds) in February. Factual Claims Prominent streaks o “This month the Chinese capital has experienced 10 days with a maximum temperature in around 35 degrees Celsius – the most for the month of July in a decade.” o “The Nikkei 225 closed below 10000 for the 12th consecutive week, the longest such streak since June 2009.” Situational facts o “Paul George had 21 points, 11 rebounds and 5 assists to become the first Pacers player with a 20/10/5 (points/rebounds/assists) game against the Bulls since Detlef Schrempf in December 1992.” o “The social world’s most viral photo ever generated 3.5 million likes, 170,000 comments and 460,000 shares by Wednesday afternoon.” Domains: politics, sports, weather, crimes, transportation, finance, social media analytics, publications http://idir.uta.edu/factwatcher/ People Make Claims All The Time “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012. http://en.wikipedia.org/wiki/Mitt_Romney http://www.thebrainchildgroup.com/ Fact Checking is not Easy “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012. http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf Fact Checking is not Easy “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012. vs http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf http://en.wikipedia.org/wiki/United_States_Navy Existing Fact Checking Projects Journalists and reporters spend good amount of time on fact checking Politifact http://www.politifact.com/ FactCheckEU https://factcheckeu.org/ FullFact http://fullfact.org/ Snopes http://www.snopes.com/info/whatsnew.asp Factcheck http://www.factcheck.org/ ClaimBusters Long-term goal o (Partly) automate fact checking process speeches debates classification interviews & ranking social media news o Plan for Election 2016 factual claims ranked by importance checked by algorithms / journalists/citizens /crowd (e.g., Twitter users) Current progress o Classification models for finding check-worthy factual statements o Preliminary exploration of crowdsourcing fact-checking Factual Claim Classification Dataset: presidential debates o Source: http://www.debates.org/index.php?page=debate-transcripts o All 30 debates (11 elections) in history: 1960, 1976—2012 o 20k sentences by presidential candidates: removed very short (< 5 words) sentences Classify each sentence into 1 of 3 classes Examples of Sentences Important factual claims “We spend less on the military today than at any time in our history.” “The President’s position on gay marriage has changed.” “More people are unemployed today than four years ago.” Unimportant factual claims “I was in Iowa yesterday.” “My mother enjoys cooking.” “I ran for President once before.” Sentences with no factual claims (just opinions, questions & declarations) “Iran must not get nuclear weapons.” “7% unemployment is too high.” “My opponent is wishy-washy.” “I will be tough on crime.” "Why should we do that?“ “Hello, New Hampshire!” “Our plan is to reduce tax rate by 10%.” Ground Truth Collection Each sentence is labelled by two of many participants. The ground truth includes the sentence only if the two participants agreed on its class label. How We Use Python Data wrangling o Use NLTK (Natural Language Toolkit) to transform debate files into structured data format o Use mysql-python-connector to store extracted features into an MySQL database o Use matplotlib to plot classifiers’ performance. Feature extraction o Use AlchemyAPI (Python wrapper) to extract rich features of sentences: keywords, POS (part-of-speech) tags, sentiments, entities, concepts, taxonomy Classification o Use scikit-learn to build classification models Feature Extraction Keywords, POS (part-of-speech) tags import nltk sentence = 'The tax policy for the middle class is bad.' pos = nltk.pos_tag(nltk.word_tokenize(sentence)) print(pos) [('The', 'DT'), ('tax', 'NN'), ('policy', 'NN'), ('for', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('class', 'NN'), ('is', 'VBZ'), ('bad', 'JJ')] Feature Extraction Sentiments from alchemyapi import AlchemyAPI alchemyapi = AlchemyAPI() sentence = ‘The tax policy for the middle class is bad.' response = alchemyapi.sentiment('text', sentence) sentiment = response['docSentiment']['score'] print(sentiment) -0.6532 Feature Extraction Entities response = alchemyapi.combined('text', sentence, {'sentiment': 1}) print(response['entities']) [{'sentiment': {'type': 'negative', 'score': '-0.653232'}, 'count': '1', 'type': 'FieldTerminology', 'relevance': '0.33', 'text': 'tax policy'}] Feature Extraction Concepts print(response[‘concepts']) [{'opencyc': 'http://sw.opencyc.org/concept/Mx4rvViw25wpEbGdrcN5Y29ycA', 'dbpedia': 'http://dbpedia.org/resource/Middle_class', 'freebase': 'http://rdf.freebase.com/ns/m.01lbc_', 'text': 'Middle class', 'relevance': '0.921176'}, {'dbpedia': 'http://dbpedia.org/resource/Social_class', 'freebase': 'http://rdf.freebase.com/ns/m.07714', 'text': 'Social class', 'relevance': '0.869326'}] Feature Extraction Taxonomy print(response[‘taxonomy']) /law, govt and politics / legal issues / legislation Classification Models Use scikit-learn to build classification models o Naïve Bayes Classifier(NBC) o Support Vector Machine (SVM) LinearSVC (linear kernel, multi-class classification) o Random Forest Classifier (RFC) 200 trees in the forest (n_estimators = 200) Preliminary Experiments 3 classes o NFS (non-factual-statement), NO (unimportant factual claim), YES (important factual claim) 5 categories of features o K: keyword; ET: entity type; P: POS tag; C: concept; T: taxonomy 5 combinations of features (+sentiment, +length) o K; K+P; K+P+ET; K+P+ET+C; K+P+ET+C+T Instances o 1571 sentences in ground truth o training data : test data = 3:1 o 4-fold cross validation Classification Using scikit-learn #last column is the class attribute features = data.columns[0:-1] #splitting train/test data (handout) msk = np.random.rand(len(data)) <= 0.75 train = data[msk][features] test = data[~msk][features] train_verdict = data[msk].verdict test_verdict = data[~msk].verdict #building and applying the model clf = RandomForestClassifier(n_estimators=200)#GaussianNB()#LinearSVC() clf.fit(train, train_verdict) prediction = clf.predict(test) #cross validation cv = np.sqrt(abs(cross_val_score(clf, data[features], data.verdict, cv=4, scoring='accuracy').mean())) Results: Precision SVM RFC NBC Results: Recall SVM RFC NBC Results: F-Measure SVM RFC NBC You are Invited http://bit.ly/1FSj9pt Acknowledgment UTA Students o Naeemul Hassan o Joseph Minumol o Afroza Sultana o Jisa Sebastine o Gensheng Zhang Collaborators o Bill Adair (Duke) o Mark Tremayne (UTA) o Pankaj Agarwal (Duke) o Min Wang (Google Research) o Sarah Cohen (Columbia) o Jun Yang (Duke) o James Hamilton (Stanford) o Cong Yu (Google Research) o Ping Luo (Chinese Academy of Sciences) Acknowledgment Funding sponsors Disclaimer: This material is based upon work partially supported by the National Science Foundation Grants 1018865, 1117369 and 1408928, 2011 and 2012 HP Labs Innovation Research Awards, and the National Natural Science Foundation of China Grant 61370019. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies. Thank You! Questions? http://ranger.uta.edu/~cli http://idir.uta.edu [email protected] Please help us to label the data http://bit.ly/1FSj9pt
© Copyright 2024