NATIONAL UNIVERSITY OF IRELAND, DUBLIN An Colaiste Ollscoile Baile Atha Cliath

UNIVERSITY COLLEGE DUBLIN
NATIONAL UNIVERSITY OF IRELAND, DUBLIN
An Colaiste Ollscoile Baile Atha Cliath
Ollscoil na hEireann, Baile Atha Cliath
___________________________________
WINTER EXAMINATIONS 2005
SEMESTER 2 EXAMINATION – 2006/2007
Sample Paper
SCBDF0005/ SCBDF0015 - B.Sc. HONOURS DEGREE EXAMINATION
ARBDF0015 – B.A. (COMPUTER SCIENCE) HONOURS DEGREE
EXAMINATION
Machine Learning
COMP 30120
Prof. P. Cunningham*
Time Allowed: 2 hours
Instructions for Candidates
Answer any three of the following five questions. All questions carry equal
weight.
Calculators allowed.
Instructions for Invigilators
Calculators allowed.
Page 1 of 3
Q1. Decision Trees
(1. a) Explain the principles behind the use of information gain to score the discrimination
power of features in supervised feature selection or in the building of decision trees.
[25%]
(1. b) Describe the steps in an algorithm for the top down induction of decision trees.
[25%]
(1. c) Explain how this algorithm for building decision trees might be considered to embody
the Occam’s Razor principle.
[25%]
(1. d) Compare and contrast decision trees and nearest-neighbour classifiers as mechanisms
for developing supervised learning systems.
[25%]
Q2. Neural Networks
(2. a)
Three possible neural network architectures are:
Feedforward Neural Nets (trained using error backpropagation).
Unsupervised Neural Networks (e.g. Self-Organising Feature Maps).
Associative Networks (e.g Hopfield Nets).
Explain the differences between these architectures in terms of their inputs and
outputs and how they are trained and used.
[75%]
(2. b)
Even in a simple single layer Feedforward Neural Network the units (neurons) will
have a fixed bias input. What is the reason for this bias input?
[25%]
Q3. Nearest Neighbour
Consider three email messages that have been saved to text files, a spam email of size 5k and
a legitimate email and an unclassified email of the same size. If the spam message is saved as
a text file and compressed using gzip it compresses to 4.2k and the non spam file compresses
to 4k. If the unknown message is concatenated to the spam message and compressed, the size
is 7k. The full picture is presented in the following table:
Message
X (spam)
Y (non-spam)
Z (unknown)
XZ
YZ
(3. a)
Size
5k
5k
5k
10k
10k
Compressed Size C()
4.2k
4k
4.5k
7k
6k
Based on this information what class is the unknown file Z?
[15%]
Page 2 of 3
(3. b)
What is the principle underlying your classification.
[25%]
(3. c)
Propose a formula for similarity embodies this principle where C(x) s the compressed
size of X.
[30%]
(3. d)
Revise this formula so that it will work with text files of different sizes.
[30%]
Q4. Ensemble methods
(4. a) Explain why diversity is important in Machine Learning ensembles for classification
and regression.
[25%]
(4. b)
Bagging (bootstrap aggregation) has a mechanism for achieving diversity; explain
how it works.
[25%]
(4. c)
Bootstrap resampling will not produce diversity in ensembles of nearest-neighbour
classifiers; describe a process that will.
[25%]
(4. d)
Explain what differentiates the different ensemble members in a Boosting Ensemble.
[25%]
Q5. Dimension Reduction
(5. a)
Explain the relevance of the concept of a random walk to information retrieval on the
web.
[20%]
(5. b)
A set of documents can be represented as a t×d term-document matrix X where t is
the number of terms and d is the number of documents. A Singular Value
Decomposition of X is as follows: X = T S D, where S is the matrix of singular
values.
i.
Explain in terms of the matrix algebra involved how this decomposition is
achieved.
[30%]
ii.
Explain the principles behind the dropping of some of the singular values prior to
using this decomposition for assessing document similarity. Are there any
potential pitfalls in dropping these singular values?
[30%]
iii.
Discuss the problem of determining exactly how many singular values should be
retained.
[20%]
Page 3 of 3