Document

Weighted kNN, clustering,
more plottong, Bayes
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 6b, February 28, 2014
1
Plot tools/ tips
http://statmethods.net/advgraphs/layout.html
http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-inr/
pairs, gpairs, scatterplot.matrix, clustergram, etc.
data()
# precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol,
ionosphere
More script fragments in Lab6b_*_2014.R on the web site
(escience.rpi.edu/data/DA )
2
Weighted KNN?
require(kknn)
data(iris)
m <- dim(iris)[1]
val <- sample(1:m, size = round(m/3), replace = FALSE,
prob = rep(1/m, m))
iris.learn <- iris[-val,]
iris.valid <- iris[val,]
iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1,
kernel = "triangular")
summary(iris.kknn)
fit <- fitted(iris.kknn)
table(iris.valid$Species, fit)
pcol <- as.character(as.numeric(iris.valid$Species))
pairs(iris.valid[1:4], pch = pcol, col = c("green3",
"red”)[(iris.valid$Species != fit)+1])
3
4
Try Lab6b_8_2014.R
New dataset - ionosphere
require(kknn)
data(ionosphere)
ionosphere.learn <- ionosphere[1:200,]
ionosphere.valid <- ionosphere[-c(1:200),]
fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)
table(ionosphere.valid$class, fit.kknn$fit)
# vary kernel
(fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,
kernel = c("triangular", "rectangular", "epanechnikov", "optimal"),
distance = 1))
table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)
#alter distance
(fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,
kernel = c("triangular", "rectangular", "epanechnikov", "optimal"),
distance = 2))
table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)
5
Cluster plotting
source("http://www.r-statistics.com/wpcontent/uploads/2012/01/source_https.r.txt") # source code
from github
require(RCurl)
require(colorspace)
source_https("https://raw.github.com/talgalili/R-codesnippets/master/clustergram.r")
data(iris)
set.seed(250)
par(cex.lab = 1.5, cex.main = 1.2)
Data <- scale(iris[,-5]) # scaling
clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width
- adjust according to Y-scale
6
Clustergram
3
2
1
0
−1
−2
−3
PCA weighted Mean of the clusters
Clustergram of the PCA−weighted Mean of
the clusters k−mean clusters vs number of clusters (k)
2
3
4
5
6
Number of clusters (k)
7
8
7
Any good?
set.seed(500)
Data2 <- scale(iris[,-5])
par(cex.lab = 1.2, cex.main = .7)
par(mfrow = c(3,2))
for(i in 1:6) clustergram(Data2, k.range = 2:8 ,
line.width = .004, add.center.points = T)
8
9
How can you tell it is good?
set.seed(250)
Data <- rbind(
cbind(rnorm(100,0, sd =
0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,1, sd =
0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),
cbind(rnorm(100,2, sd =
0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3)))
clustergram(Data, k.range = 2:5 , line.width = .004,
add.center.points = T)
10
More complex…
set.seed(250)
Data <- rbind(
cbind(rnorm(100,1, sd = 0.3),rnorm(100,0,
sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,1,
sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,1,
sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,0,
sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3)))
clustergram(Data, k.range = 2:8 , line.width = .004,
add.center.points = T)
11
• Look at the location of the cluster points on the Y axis.
See when they remain stable, when they start flying
around, and what happens to them in higher number of
clusters (do they re-group together)
• Observe the strands of the datapoints. Even if the
clusters centers are not ordered, the lines for each item
might (needs more research and thinking) tend to
move together – hinting at the real number of clusters
• Run the plot multiple times to observe the stability of
the cluster formation (and location)
http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
12
13
Swiss - pairs
pairs(~ Fertility + Education + Catholic, data =
swiss, subset = Education < 20, main = "Swiss
data, Education < 20")
14
ctree
require(party)
swiss_ctree <- ctree(Fertility ~ Agriculture +
Education + Catholic, data = swiss)
plot(swiss_ctree)
15
Hierarchical clustering
> dswiss <- dist(as.matrix(swiss))
> hs <- hclust(dswiss)
> plot(hs)
16
scatterplotMatrix
17
require(lattice); splom(swiss)
18
Decision tree (reminder)
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1
1 1 1 1 1 1 ...
> str(swiss)
…
19
Beyond plot: pairs
pairs(iris[1:4], main = "Anderson's Iris Data -- 3
species”, pch = 21, bg = c("red", "green3",
"blue")[unclass(iris$Species)])
Anderson's Iris Data −− 3 species
2.5
3.0
3.5
4.0
0.5
1.0
1.5
2.0
2.5
7.5
2.0
4.0
4.5
6.0
Sepal.Length
5
7
2.0
3.0
Sepal.Width
1.5
2.5
1
3
Petal.Length
0.5
Petal.Width
4.5
5.5
6.5
7.5
1
2
3
4
5
6
7
20
Try Lab6b_2_2014.R - USJudgeRatings
Try hclust for iris
21
gpairs(iris)
22
Try Lab6b_3_2014.R
Better scatterplots
install.packages("car")
require(car)
scatterplotMatrix(iris)
23
Try Lab6b_4_2014.R
splom(iris) # default
24
Try Lab6b_7_2014.R
splom extra!
require(lattice)
super.sym <- trellis.par.get("superpose.symbol")
splom(~iris[1:4], groups = Species, data = iris,
panel = panel.superpose,
key = list(title = "Three Varieties of Iris",
columns = 3,
points = list(pch = super.sym$pch[1:3],
col = super.sym$col[1:3]),
text = list(c("Setosa", "Versicolor", "Virginica"))))
splom(~iris[1:3]|Species, data = iris,
layout=c(2,2), pscales = 0,
varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"),
page = function(...) {
ltext(x = seq(.6, .8, length.out = 4),
y = seq(.9, .6, length.out = 4),
labels = c("Three", "Varieties", "of", "Iris"),
cex = 2)
})
parallelplot(~iris[1:4] | Species, iris)
parallelplot(~iris[1:4], iris, groups = Species,
horizontal.axis = FALSE, scales = list(x = list(rot = 90)))
> Lab6b_7_2014.R
25
26
27
28
29
Ctree
> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width, data=iris)
> print(iris_ctree)
Conditional inference tree with 4 terminal nodes
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150
1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
2)* weights = 50
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
5)* weights = 46
4) Petal.Length > 4.8
6)* weights = 8
3) Petal.Width > 1.7
7)* weights = 46
30
plot(iris_ctree)
1
Petal.Length
p < 0.001
£ 1.9
> 1.9
3
Petal.Width
p < 0.001
£ 1.7
4
Petal.Length
p < 0.001
£ 4.8
Node 2 (n = 50)
1
0.8
0.6
0.4
0.2
0
setosa
Node 5 (n = 46)
1
0.8
0.6
0.4
0.2
0
setosa
> 1.7
> 4.8
Node 6 (n = 8)
1
0.8
0.6
0.4
0.2
0
setosa
> plot(iris_ctree, type="simple”) # try this
Node 7 (n = 46)
1
0.8
0.6
0.4
0.2
0
setosa
31
Try Lab6b_5_2014.R
Try these on mapmeans, etc.
32
Something simpler – kmeans and…
> mapmeans<data.frame(as.numeric(mapcoord$NEIGHBORHOOD)
, adduse$GROSS.SQUARE.FEET,
adduse$SALE.PRICE, adduse$'querylist$latitude',
adduse$'querylist$longitude')
> mapobjnew<-kmeans(mapmeans,5, iter.max=10,
nstart=5, algorithm = c("Hartigan-Wong", "Lloyd",
"Forgy", "MacQueen"))
> fitted(mapobjnew,method=c("centers","classes"))
• Others?
33
Plotting clusters (DIY)
library(cluster)
clusplot(mapmeans, mapobj$cluster, color=TRUE,
shade=TRUE, labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
#library(fpc)
plotcluster(mapmeans, mapobj$cluster)
• dendogram?
library(fpc)
• cluster.stats
34
Bayes
> cl <- kmeans(iris[,1:4], 3)
> table(cl$cluster, iris[,5])
setosa versicolor virginica
2
0
2
36
1
0
48
14
3 50
0
0
#
> m <- naiveBayes(iris[,1:4], iris[,5])
> table(predict(m, iris[,1:4]), iris[,5])
setosa versicolor virginica
setosa
50
0
0
versicolor
0
47
3
virginica
0
3
47
pairs(iris[1:4],main="Iris Data
(red=setosa,green=versicolor
,blue=virginica)", pch=21,
bg=c("red","green3","blue")[u
nclass(iris$Species)])
35
Digging into iris
classifier<-naiveBayes(iris[,1:4], iris[,5])
table(predict(classifier, iris[,-5]), iris[,5],
dnn=list('predicted','actual'))
classifier$apriori
classifier$tables$Petal.Length
plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8,
col="red", main="Petal length distribution for the 3
different species")
curve(dnorm(x, 4.260, 0.4699110), add=TRUE,
col="blue")
curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col =
"green")
36
37
Using a contingency table
> data(Titanic)
> mdl <- naiveBayes(Survived ~ ., data =
Bayes Classifier for Discrete Predictors
Titanic) Naive
Call: naiveBayes.formula(formula = Survived ~ ., data = Titanic)
probabilities:
> mdl A-priori
Survived
No
Yes
0.676965 0.323035
Conditional probabilities:
Class
Survived
1st
2nd
3rd
Crew
No 0.08187919 0.11208054 0.35436242 0.45167785
Yes 0.28551336 0.16596343 0.25035162 0.29817159
Sex
Survived
Male Female
No 0.91543624 0.08456376
Yes 0.51617440 0.48382560
Age
38
Survived
Child
Adult
No 0.03489933 0.96510067
Try Lab6b_9_2014.R
Yes 0.08016878 0.91983122
http://www.ugrad.stat.ubc.ca/R/library/mlb
ench/html/HouseVotes84.html
require(mlbench)
data(HouseVotes84)
model <- naiveBayes(Class ~ ., data =
HouseVotes84)
predict(model, HouseVotes84[1:10,-1])
predict(model, HouseVotes84[1:10,-1], type =
"raw")
pred <- predict(model, HouseVotes84[,-1])
table(pred, HouseVotes84$Class)
39
Exercise for you
> data(HairEyeColor)
> mosaicplot(HairEyeColor)
> margin.table(HairEyeColor,3)
Sex
Male Female
279 313
> margin.table(HairEyeColor,c(1,3))
Sex
Hair Male Female
Black 56 52
Brown 143 143
Red 34 37
Blond 46 81
How would you construct a naïve Bayes classifier and test it?
40
Assignment 5
• Project proposals…
• Let’s look at it
• Assignment 4 - how is it going – assume you
all start after today?
41
Assignment 6 preview
• Your term projects should fall within the scope of a data analytics
problem of the type you have worked with in class/ labs, or know of
yourself – the bigger the data the better. This means that the work must
go beyond just making lots of figures. You should develop the project to
indicate you are thinking of and exploring the relationships and
distributions within your data. Start with a hypothesis, think of a way to
model and use the hypothesis, find or collect the necessary data, and do
both preliminary analysis, detailed modeling and summary
(interpretation).
– Note: You do not have to come up with a positive result, i.e. disproving the hypothesis
is just as good. Please use the section numbering below for your written submission for
this assignment.
•
•
•
•
•
•
Introduction (2%)
Data Description (3%)
Analysis (8%)
Model Development (8%)
Conclusions and Discussion (4%)
Oral presentation (5%) (10 mins)
42
Assignments to come
• Term project (6). Due ~ week 13/ 14 – early May. 30% (25%
written, 5% oral; individual). Available after spring break.
• Assignment 7: Predictive and Prescriptive Analytics. Due ~
week 10. 15% (15% written; individual);
43
Admin info (keep/ print this slide)
•
•
•
•
•
•
•
•
•
Class: ITWS-4963/ITWS 6965
Hours: 12:00pm-1:50pm Tuesday/ Friday
Location: SAGE 3101
Instructor: Peter Fox
Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)
Contact hours: Monday** 3:00-4:00pm (or by email appt)
Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)
TA: Lakshmi Chenicheri [email protected]
Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
44