UNIVERSITY COLLEGE DUBLIN NATIONAL UNIVERSITY OF IRELAND, DUBLIN An Colaiste Ollscoile Baile Atha Cliath Ollscoil na hEireann, Baile Atha Cliath ___________________________________ WINTER EXAMINATIONS 2005 SEMESTER 2 EXAMINATION – 2006/2007 Sample Paper SCBDF0005/ SCBDF0015 - B.Sc. HONOURS DEGREE EXAMINATION ARBDF0015 – B.A. (COMPUTER SCIENCE) HONOURS DEGREE EXAMINATION Machine Learning COMP 30120 Prof. P. Cunningham* Time Allowed: 2 hours Instructions for Candidates Answer any three of the following five questions. All questions carry equal weight. Calculators allowed. Instructions for Invigilators Calculators allowed. Page 1 of 3 Q1. Decision Trees (1. a) Explain the principles behind the use of information gain to score the discrimination power of features in supervised feature selection or in the building of decision trees. [25%] (1. b) Describe the steps in an algorithm for the top down induction of decision trees. [25%] (1. c) Explain how this algorithm for building decision trees might be considered to embody the Occam’s Razor principle. [25%] (1. d) Compare and contrast decision trees and nearest-neighbour classifiers as mechanisms for developing supervised learning systems. [25%] Q2. Neural Networks (2. a) Three possible neural network architectures are: Feedforward Neural Nets (trained using error backpropagation). Unsupervised Neural Networks (e.g. Self-Organising Feature Maps). Associative Networks (e.g Hopfield Nets). Explain the differences between these architectures in terms of their inputs and outputs and how they are trained and used. [75%] (2. b) Even in a simple single layer Feedforward Neural Network the units (neurons) will have a fixed bias input. What is the reason for this bias input? [25%] Q3. Nearest Neighbour Consider three email messages that have been saved to text files, a spam email of size 5k and a legitimate email and an unclassified email of the same size. If the spam message is saved as a text file and compressed using gzip it compresses to 4.2k and the non spam file compresses to 4k. If the unknown message is concatenated to the spam message and compressed, the size is 7k. The full picture is presented in the following table: Message X (spam) Y (non-spam) Z (unknown) XZ YZ (3. a) Size 5k 5k 5k 10k 10k Compressed Size C() 4.2k 4k 4.5k 7k 6k Based on this information what class is the unknown file Z? [15%] Page 2 of 3 (3. b) What is the principle underlying your classification. [25%] (3. c) Propose a formula for similarity embodies this principle where C(x) s the compressed size of X. [30%] (3. d) Revise this formula so that it will work with text files of different sizes. [30%] Q4. Ensemble methods (4. a) Explain why diversity is important in Machine Learning ensembles for classification and regression. [25%] (4. b) Bagging (bootstrap aggregation) has a mechanism for achieving diversity; explain how it works. [25%] (4. c) Bootstrap resampling will not produce diversity in ensembles of nearest-neighbour classifiers; describe a process that will. [25%] (4. d) Explain what differentiates the different ensemble members in a Boosting Ensemble. [25%] Q5. Dimension Reduction (5. a) Explain the relevance of the concept of a random walk to information retrieval on the web. [20%] (5. b) A set of documents can be represented as a t×d term-document matrix X where t is the number of terms and d is the number of documents. A Singular Value Decomposition of X is as follows: X = T S D, where S is the matrix of singular values. i. Explain in terms of the matrix algebra involved how this decomposition is achieved. [30%] ii. Explain the principles behind the dropping of some of the singular values prior to using this decomposition for assessing document similarity. Are there any potential pitfalls in dropping these singular values? [30%] iii. Discuss the problem of determining exactly how many singular values should be retained. [20%] Page 3 of 3