Simple Text Mining Example 31 March 2015 Text Analytics I Data is completely unstructured text, e.g. Twitter posts, newspaper articles, web pages, blogs, etc. I Our machine-learning algorithms work on numerical data in rows and columns. I Main question is how to convert text documents to rows and columns of numbers. Sample data > library(fortunes) > sentences <- sapply(1:20, function(i) fortune(i)$quote) > df <- data.frame(textCol = sentences) The tm package The documents are provided in a "source" > library(tm) > ds <- DataframeSource(df) The collection of documents is called "corpus" > dsc <- Corpus(ds) > summary(dsc) A corpus with 20 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID > Bag of words I I We count the frequencies of each word in each document. Before that, we do some processing: + remove punctuation, may be also numbers, + also remove so called "stop words". > dsc.o <- dsc # save a copy > dsc <- tm_map(dsc, tolower) > dsc <- tm_map(dsc, removePunctuation) > dsc <- tm_map(dsc, removeNumbers) > dsc <- tm_map(dsc, removeWords, stopwords("english")) > dsc.o[[1]] Okay, let’s stand up and be counted: who has been writing diamond graph code? Mine’s 60 lines. > dsc[[1]] okay lets stand counted writing diamond graph code mines lines Stemming I Now we strip the words to their root, so for example "count", "counting", "counted", "counts" all become "count". I Stemming algorithms are in package SnowballC > dsc.p <- dsc # save another copy > library(SnowballC) > dsc <- tm_map(dsc, stemDocument) > dsc <- tm_map(dsc, stripWhitespace) > dsc.p[[1]] okay lets stand counted writing diamond graph code mines lines > dsc[[1]] okay let stand count write diamond graph code mine line Term Document Matrix I The Term Document Matrix (TDM) has a row for each term and a column for each document and the value is the count. I The Document Term Matrix (DTM) is the transpose. > tdm <- TermDocumentMatrix(dsc) > tdm A term-document matrix (144 terms, 20 documents) Non-/sparse entries: Sparsity : Maximal term length: Weighting : > 165/2715 94% 11 term frequency (tf) Try simple clustering > dis <- dissimilarity(tdm, method="cosine") > (h <- hclust(dis, method="ward")) Call: hclust(d = dis, method = "ward") Cluster method : ward Distance : cosine Number of objects: 20 > plot(h, sub="") Hierarchical Cluster Tree Other things to try I What we did is a bag of unigrams – single words. We can also try bigrams, trigrams, and generally N-grams. I The number of terms increases very fast, so remove infrequent terms. I Also, there are (many) more columns than rows – very high-dimensional problem. I Use Part Of Speech tagging.
© Copyright 2024