Weighted kNN, clustering, more plottong, Bayes Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 1 Plot tools/ tips http://statmethods.net/advgraphs/layout.html http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-inr/ pairs, gpairs, scatterplot.matrix, clustergram, etc. data() # precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere More script fragments in Lab6b_*_2014.R on the web site (escience.rpi.edu/data/DA ) 2 Weighted KNN? require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m)) iris.learn <- iris[-val,] iris.valid <- iris[val,] iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") summary(iris.kknn) fit <- fitted(iris.kknn) table(iris.valid$Species, fit) pcol <- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1]) 3 4 Try Lab6b_8_2014.R New dataset - ionosphere require(kknn) data(ionosphere) ionosphere.learn <- ionosphere[1:200,] ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) # vary kernel (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) #alter distance (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) 5 Cluster plotting source("http://www.r-statistics.com/wpcontent/uploads/2012/01/source_https.r.txt") # source code from github require(RCurl) require(colorspace) source_https("https://raw.github.com/talgalili/R-codesnippets/master/clustergram.r") data(iris) set.seed(250) par(cex.lab = 1.5, cex.main = 1.2) Data <- scale(iris[,-5]) # scaling clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale 6 Clustergram 3 2 1 0 −1 −2 −3 PCA weighted Mean of the clusters Clustergram of the PCA−weighted Mean of the clusters k−mean clusters vs number of clusters (k) 2 3 4 5 6 Number of clusters (k) 7 8 7 Any good? set.seed(500) Data2 <- scale(iris[,-5]) par(cex.lab = 1.2, cex.main = .7) par(mfrow = c(3,2)) for(i in 1:6) clustergram(Data2, k.range = 2:8 , line.width = .004, add.center.points = T) 8 9 How can you tell it is good? set.seed(250) Data <- rbind( cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)), cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))) clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T) 10 More complex… set.seed(250) Data <- rbind( cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))) clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T) 11 • Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together) • Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters • Run the plot multiple times to observe the stability of the cluster formation (and location) http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ 12 13 Swiss - pairs pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20") 14 ctree require(party) swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree) 15 Hierarchical clustering > dswiss <- dist(as.matrix(swiss)) > hs <- hclust(dswiss) > plot(hs) 16 scatterplotMatrix 17 require(lattice); splom(swiss) 18 Decision tree (reminder) > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... > str(swiss) … 19 Beyond plot: pairs pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)]) Anderson's Iris Data −− 3 species 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 7.5 2.0 4.0 4.5 6.0 Sepal.Length 5 7 2.0 3.0 Sepal.Width 1.5 2.5 1 3 Petal.Length 0.5 Petal.Width 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 20 Try Lab6b_2_2014.R - USJudgeRatings Try hclust for iris 21 gpairs(iris) 22 Try Lab6b_3_2014.R Better scatterplots install.packages("car") require(car) scatterplotMatrix(iris) 23 Try Lab6b_4_2014.R splom(iris) # default 24 Try Lab6b_7_2014.R splom extra! require(lattice) super.sym <- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6, .8, length.out = 4), y = seq(.9, .6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) }) parallelplot(~iris[1:4] | Species, iris) parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90))) > Lab6b_7_2014.R 25 26 27 28 29 Ctree > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264 2)* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865 5)* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46 30 plot(iris_ctree) 1 Petal.Length p < 0.001 £ 1.9 > 1.9 3 Petal.Width p < 0.001 £ 1.7 4 Petal.Length p < 0.001 £ 4.8 Node 2 (n = 50) 1 0.8 0.6 0.4 0.2 0 setosa Node 5 (n = 46) 1 0.8 0.6 0.4 0.2 0 setosa > 1.7 > 4.8 Node 6 (n = 8) 1 0.8 0.6 0.4 0.2 0 setosa > plot(iris_ctree, type="simple”) # try this Node 7 (n = 46) 1 0.8 0.6 0.4 0.2 0 setosa 31 Try Lab6b_5_2014.R Try these on mapmeans, etc. 32 Something simpler – kmeans and… > mapmeans<data.frame(as.numeric(mapcoord$NEIGHBORHOOD) , adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobjnew<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobjnew,method=c("centers","classes")) • Others? 33 Plotting clusters (DIY) library(cluster) clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions #library(fpc) plotcluster(mapmeans, mapobj$cluster) • dendogram? library(fpc) • cluster.stats 34 Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl$cluster, iris[,5]) setosa versicolor virginica 2 0 2 36 1 0 48 14 3 50 0 0 # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor ,blue=virginica)", pch=21, bg=c("red","green3","blue")[u nclass(iris$Species)]) 35 Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) classifier$apriori classifier$tables$Petal.Length plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue") curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green") 36 37 Using a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~ ., data = Bayes Classifier for Discrete Predictors Titanic) Naive Call: naiveBayes.formula(formula = Survived ~ ., data = Titanic) probabilities: > mdl A-priori Survived No Yes 0.676965 0.323035 Conditional probabilities: Class Survived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Survived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age 38 Survived Child Adult No 0.03489933 0.96510067 Try Lab6b_9_2014.R Yes 0.08016878 0.91983122 http://www.ugrad.stat.ubc.ca/R/library/mlb ench/html/HouseVotes84.html require(mlbench) data(HouseVotes84) model <- naiveBayes(Class ~ ., data = HouseVotes84) predict(model, HouseVotes84[1:10,-1]) predict(model, HouseVotes84[1:10,-1], type = "raw") pred <- predict(model, HouseVotes84[,-1]) table(pred, HouseVotes84$Class) 39 Exercise for you > data(HairEyeColor) > mosaicplot(HairEyeColor) > margin.table(HairEyeColor,3) Sex Male Female 279 313 > margin.table(HairEyeColor,c(1,3)) Sex Hair Male Female Black 56 52 Brown 143 143 Red 34 37 Blond 46 81 How would you construct a naïve Bayes classifier and test it? 40 Assignment 5 • Project proposals… • Let’s look at it • Assignment 4 - how is it going – assume you all start after today? 41 Assignment 6 preview • Your term projects should fall within the scope of a data analytics problem of the type you have worked with in class/ labs, or know of yourself – the bigger the data the better. This means that the work must go beyond just making lots of figures. You should develop the project to indicate you are thinking of and exploring the relationships and distributions within your data. Start with a hypothesis, think of a way to model and use the hypothesis, find or collect the necessary data, and do both preliminary analysis, detailed modeling and summary (interpretation). – Note: You do not have to come up with a positive result, i.e. disproving the hypothesis is just as good. Please use the section numbering below for your written submission for this assignment. • • • • • • Introduction (2%) Data Description (3%) Analysis (8%) Model Development (8%) Conclusions and Discussion (4%) Oral presentation (5%) (10 mins) 42 Assignments to come • Term project (6). Due ~ week 13/ 14 – early May. 30% (25% written, 5% oral; individual). Available after spring break. • Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 10. 15% (15% written; individual); 43 Admin info (keep/ print this slide) • • • • • • • • • Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: [email protected], 518.276.4862 (do not leave a msg) Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri [email protected] Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014 – Schedule, lectures, syllabus, reading, assignments, etc. 44
© Copyright 2025