Download Report

Simple Text Mining Example
31 March 2015
Text Analytics
I
Data is completely unstructured text, e.g. Twitter posts,
newspaper articles, web pages, blogs, etc.
I
Our machine-learning algorithms work on numerical data
in rows and columns.
I
Main question is how to convert text documents to rows
and columns of numbers.
Sample data
> library(fortunes)
> sentences <- sapply(1:20, function(i) fortune(i)$quote)
> df <- data.frame(textCol = sentences)
The tm package
The documents are provided in a "source"
> library(tm)
> ds <- DataframeSource(df)
The collection of documents is called "corpus"
> dsc <- Corpus(ds)
> summary(dsc)
A corpus with 20 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
>
Bag of words
I
I
We count the frequencies of each word in each document.
Before that, we do some processing:
+ remove punctuation, may be also numbers,
+ also remove so called "stop words".
> dsc.o <- dsc
# save a copy
> dsc <- tm_map(dsc, tolower)
> dsc <- tm_map(dsc, removePunctuation)
> dsc <- tm_map(dsc, removeNumbers)
> dsc <- tm_map(dsc, removeWords, stopwords("english"))
> dsc.o[[1]]
Okay, let’s stand up and be counted: who has been writing
diamond graph code? Mine’s 60 lines.
> dsc[[1]]
okay lets stand
counted
writing diamond graph code
mines lines
Stemming
I
Now we strip the words to their root, so for example
"count", "counting", "counted", "counts" all become "count".
I
Stemming algorithms are in package SnowballC
> dsc.p <- dsc
# save another copy
> library(SnowballC)
> dsc <- tm_map(dsc, stemDocument)
> dsc <- tm_map(dsc, stripWhitespace)
> dsc.p[[1]]
okay lets stand
counted
writing diamond graph code
mines lines
> dsc[[1]]
okay let stand count write diamond graph code mine line
Term Document Matrix
I
The Term Document Matrix (TDM) has a row for each term
and a column for each document and the value is the
count.
I
The Document Term Matrix (DTM) is the transpose.
> tdm <- TermDocumentMatrix(dsc)
> tdm
A term-document matrix (144 terms, 20 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
>
165/2715
94%
11
term frequency (tf)
Try simple clustering
> dis <- dissimilarity(tdm, method="cosine")
> (h <- hclust(dis, method="ward"))
Call:
hclust(d = dis, method = "ward")
Cluster method
: ward
Distance
: cosine
Number of objects: 20
> plot(h, sub="")
Hierarchical Cluster Tree
Other things to try
I
What we did is a bag of unigrams – single words. We can
also try bigrams, trigrams, and generally N-grams.
I
The number of terms increases very fast, so remove
infrequent terms.
I
Also, there are (many) more columns than rows – very
high-dimensional problem.
I
Use Part Of Speech tagging.